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I. Basis of the report 

1 . This report has been drawn on the basis of (substitute streets which have been furnished to the receiving Office in 
response to an invitation under Article 14 are referred to in this report as "originally filed" and are not annexed to 
the report since they do not contain amendments (Rules 70. 16 and 70.17).): 

Description, pages: 

1 ,6,1 1 ,20,22-24, as originally filed 
32,34 

2-5,7-1 0,1 2-1 9.21 . as received on 23/02/2001 with letter of 1 9/02/2001 

25-31 ,33 

Claims, No.: 

1-12 as received on 23/02/2001 with letter of 19/02/2001 
Drawings, sheets: 

1 /1 3-4/1 3, as originally filed 
6/13-13/13 

5/1 3 as received on 09/1 0/2000 with letter of 04/1 0/2000 

2. With regard to the language, all the elements marked above were available or furnished to this Authority in the 
language in which the international application was filed, unless othenwise indicated under this item. 

These elements were available or f umished to this Authority in the following language: , which is: 

□ the language of a translation fumished for the purposes of the international search (under Rule 23.1 (b)). 

□ the language of publication of the international application (under Rule 48.3(b)). 

□ the language of a translation fumished for the purposes of international preliminary examination (under Rule 
55.2 and/or 55.3). 

3. With regard to any nucleotide and/or amino acid sequence disclosed in the international application, the 
Intemational preliminary examination was carried out on the basis of the sequence listing: 

□ contained in the intemational application in written form. 

□ filed together with the intemational application in computer readable form. 

□ fumished subsequently to this Authority in written form. 

□ fumished subsequently to this Authority in computer readable form. 

□ The statement that the subsequently fumished written sequence listing does not go beyond the disclosure in 
the intemational application as filed has been fumished. 
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□ The statement that the information recorded in computer readable fonm is identical to the written sequence 
listing has been furnished. 

4. The amendments have resulted in the cancellation of: 

□ the description, pages: 

□ the claims, Nos.: 

□ the drawings, sheets: 

5. □ This report has been established as if (some of) the amendments had not been made, since they have been 

considered to go beyond the disclosure as filed (Rule 70.2(c)): 

(Any replacement sheet containing sucii amendments must be referred to under item 1 and annexed to this 
report.) 



6. Additional observations, if necessary: 



V. Reasoned statement under Article 35(2) with regard to novelty, inventive step or industrial applicability; 
citations and explanations supporting such statement 

1. Statement 

Novelty (N) Yes: Claims 1-12 

No: Claims 

Inventive step (IS) Yes: Claims 1-12 

No: Claims 

Industrial applicability (lA) Yes: Claims 1-12 

No: Claims 



2. Citations and explanations 
see separate sheet 



VII. Certain defects in the International application 

The following defects in the form or contents of the international application have been noted: 
see separate sheet 



Vlil. Certain observations on the international application 

The following observations on the clarity of the claims, description, and drawings or on the question whether the 
claims are fully supported by the description, are made: . 
see separate sheet 
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Re Item V 

Reasoned statement under Article 35(2) with regard to novelty, inventive step or 
industrial applicability; citations and explanations supporting such statement 

1. Reference is made to the following documents cited in the international search 
report: 



D1: EP-A-0 849 916 (IBM), 24 June 1998 (1998-06-24); 

D2: MCKEOWN N ET AL: TINY TERA: A PACKET SWITCH CORE' IEEE 

MICRO,US,IEEE INC. NEW YORK, vol. 17, no. 1, 1 January 1997 (1997- 
01-01), pages 26-33, XP000642693 ISSN: 0272-1732. 

2. The subject-matter of independent claims 1 and 7 is considered to involve an 
inventive activity for the following reasons: 

Basically, document D2 discloses a method of transmitting information through a 
data switching apparatus, as defined in the preamble of claim 1 . Note that the 
term "port" defined in D2 is considered to be a router according to one possible 
realisation of the switch disclosed in document D2 (page 28, right-hand column, 
lines 13-1 5: "a different processor could [,.,] implement IP routing'), and that the 
routers maintain virtual output queues. 

The subject-matter of claim 1 particularly differs from the method disclosed in 
document D2 in that, in the present claim 1 . the switch controller schedules the 
passage of the cells across the switch fabric by using a first arbitration process to 
select an input router to connect an output router and in that the selected input 
router performs a second arbitration process to select a virtual output queue 
according to the output router. 

The problem to be solved by the present invention, according to the above 
mentioned differences, can be considered as how to select the virtual input queue 
to transmit a cell trough the data switching apparatus. 

The solution disclosed in document D2 consists in the selection of the virtual 
output queue directly by the switch controller since it indicates directly to the input 
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router which queue it has to select (page 28, lefthand column, lines 14-17: ''when 
the scheduler directs the port processor to read from a particular input queue") ; it 
is basically different from claim 1 , where the selection of the virtual output queue 
is decided by the input router. This feature is not obviously derivable from D2. 

Moreover, document D1 does not disclose such a solution. The data switching 
apparatus is different: it does not have routers ; the switch core access layers only 
multiplex or serialize the information they receive, which is different from routing. 

Therefore, starting from document D2, a person skilled in the art trying to solve 
the above mentioned problem will not arrive, by considering the document D2 
alone or in combination with document D1 , to the method disclosed in claim 1 , 
without involving an inventive activity. 

Likewise, since claim 7 discloses the means corresponding to the steps of the 
method claim 1 , claim 7 is considered, for similar reasons to that mentioned for 
claim 1 , to involve an inventive activity. 

3. Consequently, claims 2-6 and 8-12 depending respectively on claims 1 and 7 are 
also considered to involve an inventive activity. 

Re Item VII 

Certain defects In the international application 

1 . The reference signs used for the "switch fabrid* is not consistent in claims 1 and 7; 
claim 1 uses "SC/lf' while claim 7 uses "SC/l/f* or "SP (Rule 11.13(m) PCT). 

2. The final punctuation of claims 4, 6 and 10 is not correct. 
Re Item VIII 

Certain observations on the international application 

In claim 1 , the temn "the data switch" (page 35, line 4) is ambiguous. It is however 
obviously used instead of the temn "the data switching apparatus". 
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which means that the storage devices must be very hig^ perforxnance. Hence, at 
veiy high data rates current tedmology limits the use of ^ 

It Is an Ob ject of ihti inveiiiiuii lu piuvide a data switching method and 
appai^tus for^t^more e£5cient handling of packets of infonnation througih a data 
switch. \ 

Actording to a first aspect of the invention there is provided a method 
of handling packet^f information through a data switch comprising iiq>ut trafBc 
controllers, ingress routers, a memoryless cyclic switch febric, egress routers and 
output trafBc controllei^all under the control of a switch controller and 
interconnected such that eachinput line connected to the data switch is terminated 
on a traflBc controller arrangeasto convert the input line protocol information 
packets into fixed length cells having header defining die data switch destination 
router and output traffic controller toWther with .message priority information 
arranged such diat each ingress router\erves a group of traffic controllers 
characterised in that the ingress router include^a set of input buffers one for each 
input line and a set of virtual outpiit queue burS^ one for each output trafBc 
controller fix)m the data switch, and in which the meh;^ coniprises on the arrival 
of a cell firom a trafBc controller the ingress router ex^tunes the cell header and 
places it in the appropriate virtual output queue and generals a request for transfer 
message consisting of the destination tra£5c controller address and a message 
priority code which is passed to the data switch controller, the sWtch controller 
schedules the passage of the cells across the switch fabric by interconnecting a 
specific ingress router to a specific egress router for each switch iabricscycle in 
accordance with a first arbitration process the ingress router selecting fi-^n the 
appropriate virtual output queue the cell at the head of the queue for passage ao^^ 
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The article "Tiny Tera: a packet switch core", IEEE Micro, US, IEEE Inc. New 
York, vol. 17, no. 1, 1997. p26-33. discloses a data switch having a crossbar interface 
controlled by a scheduler. The input to the switch includes a number of "port cards". 
Each port card comprises a number of data slices, which each generate a 64-bit section 
("chunk") of a message. On each port card there is a single port processor, which 
detemnines where data slice should store each chunk. Upon receipt of a packet of data, 
the port processor informs a scheduler of newly arrived packets. The scheduler controls 
the crossbar interface to connect the data slices to outputs of the switch. 

it is an object of the invention to provide a data switching method and apparatus 
for a more efficient handling of packets of information through a data switch. 

According to a first aspect of the invention there is provided a method according 
to claim 1 . 

According to a second aspect of the invention there is provided a data switch 
according to claim 7. 

Preferably, each ingress router maintains an input buffer for each of the group of 
input traffic controllers from which it receives signals 
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the data switch to the appropriate ou^ut trafBc controller in accordance wititi a 

\ According to a second aspect of the invmtaon thcrc is provided a data 
switch fordiandling packets of infotmation comprising input trafBc controllers> 
ingress rout^ a memotyless cyclic switch &bnc, egress routers and output trafiBc 
controUers all imder the control of a switch controller and interconnected such fliat 
each input line connected to the data switch is terminated on a trafBc controller 
arranged to converrflie input line protocol information packets into fixed length 
cells having a header owning the data switch destination router and output trafBc 
controller together withViessage priority information arranged such that each 
ingress router senses a groigi of trafBc controllers characterised in that the ingress ' 
router includes a set of inpurbufifers one for each input line and a set of virtual 
oiitput queue buffers, one for ewh output trafBc controller connected to the data 
switch, and in which on die arrival of a cell fiom a trafBc controller the ingress 
router examines the cell header and places it in the appropriate virtual output queue 
and generates a request for transfer menage consisting of the destination trafBc 
controller address and a message priority opde which is passed to the data switch 
controller, the switch controller schedules the^assage of the cells across the switch 
fabric by interconnecting a specific ingress ro^er to a specific egress router for 
each switch &bric cycle in accordance with a nrst arbitration process and die 
ingress roizter selects from the appropriate virtual odbut queue the ceU at the head 
of the queue for passage across the data switch to thXappropriate output trafiBc 
■^iontroller in aceordanoo witti a G e cond orfaitmtion proccsS w » l rv _ a 

The invention together with its various features will be more readily 
understood from the follo^^g description ii^ch should be read in conjunction 
with the accompanying drawings, in which:- 
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Fig. 1 shows a generalised concept of the prior art. 
Fig. 2 shows in block diagranifisnn one embodinient the data swi^ 
of the invention. 

Fig. 3 shows the switdi fabric of the embodiment of the invention. 

Fig. 4 shows the flow ofdata through the switch febric. 

Fig. 5 shows AIM fiame headers when passing through the switch 



fabric, 
fabric, 
switch. 



Fxg.'6 shows Ediemet fiame headers wiien passing tttrough the switch 
Fig. 7 shows the scheduling and arbitradoii a n an gcments of the data 



Fig. 8 shows an egress backpressure broadcast, 

Fig. 9 shows the switch block diagrax^. 

Fig. 1 0 shows the detail of the switch block. 

Fig. 11 shows a block diagram of the master according to the 
embodiment of the invention. 

Fig. 12 shows a block diagram of a router according to ttie 
embodiment of the invention, whilst 

Fig. 13 shows the queue structure. 

Referring now to Figure 1 , this shows the general concept of a data 
switch. Inputs Nl toNn are connected to respective input ports PI toIPn of a data 
switch SW. The switch has oi^^ut ports OPl to OPn connected to respective 
outputs Ml to Mn. 

With intelligent distributed scheduling mechanisms it is possible to 
create an iiqnit bufibred switch \^ch meets the same traffic shaping efBciency of 
its output buffered counterpart The use of input buffers is preferred for several 
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reasons. Iiq>iit bufifering requires smaUer bufTen^ 
perfonnance and therefore be cheaper. 

When cells are queued at the ixtput there is the possibility^ of 
contention arising throu^^ the phenomena of Head Of Line (HOL) blocking. This 
generally occurs Tvhm First In First Out (FIFO) queue medianisms are used. The 
FIFO queues the cell at the head of the queue and this is the only one that can be 
chosen for delivery through the switch. Now, consider the case where an izxputport 
has three cells cl, c2, c3 stored such that cl is at the head of &e queue with £2 
stored next and c3 last with cell cl destined for port N and cell c2 destined for port 
N+1. Now port N is already connected to port N-1 therefore cl cannot be 
switched, however port N+1 is unconnected and therefore c2 could actually be 
delivered. However, c2 cannot get out of the FIFQ because it is blocked by the 
HOL i.e. cL An intelligent approach to the solution of HOL blocking is tiie 
concept of Virtual Output Queues (VOQ). Using VOQs the cells are separated out 
at the input into queues which map directly to &eir required output destination. 
They can therefore be effectively described at being ou^ut queues, which are held 
at the input i.e. Virtual Output Queues. Since the cells are now separated out in 
terms of their ou^ut destination they can no longer be blocked by the HOL 
phenomena. 

There is also the question of Quality Of Service (QflS) to address. 
Different input sources have different requirements in terms of how their data 
should be delivered. For example voice data must be guaranteed to a very tightly 
controlled delivery service whereas the handling of computer data can be more 
relaxed To accommodate th^ requirements the concept of priority can be used. 
Data is given a level of priority, which changes the way the switch deals with it 
For example consider two cells in different VOQs cl and c2 which are both 
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Referring now to Figure 2, &|mam feature is ttie data switch SW. 
hspvts are provided to the switch fiomis^^esd-traffic manager units ITMo to ITM^. 
Each ffi^^^trafSc manager m^ have one or more input line end devices (ILE) 
connected to it Ou^uts &om the switch SW are connected by way of ^^^ess trafBc 
manager units ETMo to ETMq to ogroaa l ine end devices (ELE). 



The trafBc manager units (TTM and ETM) provide ^^^tocol- 
specific processing in the switch, such as . congestion buffering, ingre ss trafBc 
policing, address translation (ingress and ingress) and routing (ingress), trafBc 
shaping (ingress or egress), collection of trafBc statistics and line level diagnostics. 
There may also be some segmentation and re-assembly functionaliQr within a 
trafBc manager unit The line end devices (ILE and ELE) are full-duplex devices 
and provide the switch port physical interfaces. Typically, line end devices will be 
operated in synchronous transfer mode, ranging from OC-3 to CX::-48 rates or 
1 0/1 00 and Gigabit Ethemet 

The switch SW provides the application independent, loss-less 
transport of data between the trafBc managers based on routing information 
provided by the trafBc managers and the connection allocation policies determined 
by the switch control SC. This controls the global functions of the switch such as 
connection management, switch level diagndistics, statistics collection and 
redundancy management 

The switching system just described is based on an input-queued non- 
blocking crossbar architecture. A combination of adequate buffering, hierarchic 
flow control, and distributed scheduling and arbitration processes ensure loss-less, 
efiBcient, and high performance switching capabilities. It should be noted that the 
ingress and egress functions are shown separately on either side of the drawing. 
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In reality, trafBc manager units , lin&-end devices ingress and egress ports may be 
considered fiill duplex* 

Fig^ 3 shows the basic^architecture of the switch according to one 
embodiment of the invention. The ia^^ess trafBc manage units ITM described 
above connect streams of data to a number of ingress routers SRIq to SRIp. These 
routers are connected to the switdiing matrix SCM, which is itself controlled by 
a switch controller SM. Data outputs from th^s^^^ing matrix SCM are passed 
by egress routers SREo to SREp and on to th&e^ess traffic manager units ETM 

The ^^^1^^ routers SRIq to SRIp on the ingress side collect data 
streams from the-ingpess trafBc manager units FTM, request connections across the 
switching matrix SCM to the controller SM, queue up data packets (referred to as 
'Hensors^O controller SM grants a connection and then sends the data to the 

switching matrix SCM. On the egress side, the egress routers SREo to SREp, sort 
data packets into the relevant data strums and forward them to the appropriate 
^^^^^^traffic manager units ETM. Each ingress and egress router commimicates 
on a point to point basis with two traffic manager imits over a common switch 
interface. Each inter&ce is 32 bits wide (fiill duplex) and can operate at either SO 
or lOOMHz. Throug^i its common inter&ces the routers can support to 5Gbs of 
cell-based traffic such as ATM or 4Gbs of packet based traffic such as^gabit 
Ethernet These 4 or 5Gbs of data share a small amount of external memory. 

The switch controller SM takes connection requests from the ingress 
routers and creates sets of cormections in the switching matrix SCM. The 
controller SM arbitration mechanisms maximise the efficiency of the swtching 
SCM ifvliile maintaining &imess of service to the routers. The controller SM is able 
to configure one-to-one (unicast) and one-to-all (broadcast) connections in the 
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switching matrix SCM. The controller SM selects an optimal combination of 
connections to establish in the matrix SCM once per switching cycle. The selection 
can be pos^oned by one (or more) backpressure broadcast requests that are 
satisfied in a round-robin &ishion before allowing normal operation to resume. The 
arbiter also uses a probabilistic work-consoving algorithm to allocate bandwidth 
in the switching matrix to each priority according to information defined by the 
external system controller. 

The switchingmatrix SCM itself consist of a number of memoxy-Iess, 
non-blocking matrix planes SCMl - N and a number of embedded serial 
transceivers to interface to the routers. Hie number of matrix planes in a particular 
switch depends on the core throughput required across the matrix. The core 
throughpxxt will be greater than the aggregate of the fsxtemal inter&ces to allow for 
inter-router conmnmication, core header overheads and maximal coimectionis 
during the arbitration cycles. The device is packaged with two planes of sixteen 
ports, which can be configured to provide an alternative number of planes/^orts. 
The multiple serial links that comprise the data path between the router and 
switching matrix are switched simultaneously and therefore act as a single full 
duplex fat pipe of SGbps. Tlie switching matrix has ti n>u g vef feature that it can be 
configuredasa*N?d^'portcrossbardevicewfaereNcanbe4, 8or 16. Thisfeature 
can increase the number of planes per package and therefore allows a wide range 
of systems to be realised cost effectively. For example using the first generation 
chip set systems of less than 20Gbs up to SOGbs can be easily configured. 

Underlying the management of die system is the &bnc management 
interface FMI, which provides an external orthogonal interface into all of the 
system devices. This level of management provides read/write acc^ to a chosen 
subset of important registers and RAMs \^^e the device is functioning normally. 
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and will provide access to all the registers and RAMs in a device (using the scan 
mechanism) if the system is inoperable* Management access can be used for the 
purposes of system initialisation, and dynamic reconfiguration. The following 
features need to be configured via the &bric management inter&ce FMI at system 
reset but c^ be modified on a live system: Ingress router queue parameter 
sensitivities , ingress and egress queue thresholds, bandwiddi allocation tables in the 
routers and switch controller and status information. Each device has a primary 
status register, which can be read to obtain the high-level view of the device status; 
for example, the detection of non-critical failures. If necessary, more detailed 
status registers can then be accessed. 

If a device or the total system feils, fabric interfece management access into 
the chip-set is still possible. This will normally provide useful information in 
diagnosing the fault It can also be used to perform low-level testing of the 
hardware. 

Detailed error management facilities have been built into the system. The 
management of errors can be considered imder the headings of detection, 
correction, containment and reporting as desoibed below:- 

a) Detection. Within the system, all interfaces between devices are checked 
as follows:- parallel inter&ces between devices are protected by pariQr. Serial data 
being roiited fiom one router to another via the switching matrix is protected by a 
sixteen bit cyclic redundancy code. This is generated in the ingress router and 
forms part of the tensor. It is checked and discarded at the egress router. All 
external inter&ces support parity and the common switch inter&ce specification 
includes optional parity. This is implemented at the system end of the interface. 
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There are certain units within, the system that are common to all devices. 
The two units of most interest are the Central Management Unit and the Fabric 



Data is handled tfaroug}ioiit the system in fixed length cells. There are 
several reasons for using fixed length cells, one of which is that the quality of 
service (QoS) is easier to guarantee \^en the switch is reconfigured after every 
switch cycle. In addition, the packet latency is inxproved for both long and short 
packets and the buffo* management is simplified. In practice, there are slight 
variations in the format of the cells, due to the need to include steering information 
in headers at various points. Figure 4 shows the flow of data through Hie switch 
fabric and the functions perfoimed by the seven steps shown in the diagram are 
detailed below;- 

, Firstly, packets received finom a line end are, where necessary, segmented 
in a iogcess traffic manager ITM and formed mto cells of the correct format to be 
transferred over the common interface, denoted in Figure 4 as CSIX; 

Secondly, at the ingress router SRI, arriving cells are examined and placed 
in the q>propriate queue. There are several sets of queues, shown here as imicast 
queues UQ, multicast queues MQ and broadcast queues BQ.. In the diagram the 
cell has been placed into one of the unicast queues; 

Thirdly, the arrival of a cell triggers a 'request for transfer' RFT to the 
controller SM. The cell will be held in the queue until this request is granted; 

At step 4, the controller SM executes an arbitration process and determines 
the maidmal coimection set that can be established within the switching matrix 
SCM for the next switdiing cycle. It 4en grants the 'request to transfer* RTT and 
signals the egress router SRE that it must expect a cell. 

At the fifth step, the ingress router SRI, having been granted a coimection, 
also executes an arbitration process to determine which cell will be transferred. 



Management Inter&ce. 
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The cell is transferred tfaroB^ the memory-less switching matrix SCM and into a 
buffer in the egress router SRE. 

As shown at stq> 6, there is one egress ba£fer per e^ess trafBc manager 
ETM and arriving cells are examined and placed in Ae appr o pr iate tra£5c manager 
queue in the egress router SRE. xAp;tr 

Finally, at step 7, the cell is transferred to the «grcss trafBc manager EME 
over the standard interface CSIX and, where necessary, re-assembled into a packet 
before onward transmission. 

The transfer of data through the system is packaged in cells temied toisors. 
An arbitration cycle transfers one tensor per router through the switching matric 
SCM. Each tensor consists of 6 or 8 vectors. A vector consists of one byte per 
plane of the switching matrix and is transferred through it in one system clock 
cycle. The sizes of the vector and tensor for a particular ^Tplication are determined 
by the bandwidth required in the fabric and the most appropriate cell size.. The 
following sections show the typical packaging of the data as it flows through the 
system for ATM and Ethernet 



As shown in Figure 5a, illustrating the ASTMg>plication,payload cells 
P containing fifty three bytes of data arriving from an ingress trafBc manager TIM 
across the inter&ce CSIX are re-packaged into 60-byte tensors (6 vectors of 10 
bytes). The ingress router analyses the CSIX header UH and wraps^^ CSIX 
packet wifli the core header CH to create a 60-byte tensor UCT in an mgeesa queue. 
When the controller SM grants the required coimection the tensor passes through 
the switching matrix SM in one switch cycle to the egress router which writes the 
unicast tensor UT into the e»©sfe queue indicated in the core header. When the 
tensor reaches the head of 1hc egee»qam^ the core header is stripped off and the 
remaining CSIX packet is sent to the e^ese-traffic manager. 
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If the CSDC 6ame type indicates a multic^ packet MI as shown in Figure 
5b, the ingles router strips out ^e multicast mask MM and rqplicates the packet. 



apprqpriate. The flow then proceeds as for unicast, except tiiat the tensor is written 
simultaneously into multiple egress bufifers after passing through the switching 
matrix 

Q , In the case of Ethernet or variable length packets as shown in Figure 6, an 
■ ingr e s s trafBc manager ITM using segmentation and reassembly functionality 
(SAR) converts the variable length packets VLP into CSDC packets at ingress, 
embedding the SAR header in the payload. CSIX packets are then transported 
though the system in the same way as for the ATM example of Figure 5, except 
that the tmsor siae is set to 80 bytes (8 vectors of 10 bytes) allowing up to 70 bytes 
of Ethernet frame to be earned in a single segmented packet Note that the 
segmentation header is considered as private to tiie trafiBc managers and is shown 
for illustrative purposes only. Tlie system treats it transparently as part of the 
payload. The CSK inter&ce desaiption allows for truncated packets, that is, if a 
trafBc manager sends a p^load that would not fill a tensor it can send a shortened 
CSIX packet The ingress router stores the short packet in the ingress queues (on 
fixed tensor boundaries). Any part ofthe tensor queue that is not used is filled with 
INVALID bytes. The fixed size tensors will then have the INVALID bytes 
discarded at the egress router. 

In the systeni architecture, fte scheduling and arbitration arrangement is 
distributed and occurs at two points; in the controller SM (between switch ports 
and between priorities) and in the router (between traffic managers) SRS/A. Figure 
7 is a conceptual diagram showing only the scheduling/arbitration fimctions across 
the data switch from the traffic managers TM throug^i the common interface CSDC 



into the indicated 




queues, modifying &e target field for each copy as 
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to flie routers SR and the controUerSCM into the switch^ The 
diagram also shows how mfonnation on channel^ link bandwidth allocation and 
switch efBciency, queue status, baclqpressure and trafiSc congestion management 
is handled by the referenced arrows* 

The controller SM profvides the overall control function of the system. 
When die routers request connections from the controller, they identify tiieir 
requested switching matrix connection by switch port and priority ThecontroUer 
then selects combinations of connections in the switching matrix to make best use 
of the matrix coimectivity and to provide fair service to the routers. This is 
accomplished by using an arbitration mechanism. The controller SM 'can also 
enforce pseudo-static bandwidth allocation across titie priorities and ingress/egress 
switch port combinations. For exantple, an external system controll^ can 
guarantee a proportion of the available bandwidth to each of the priorities and to 
specific coxmections. Unused allocations will be &irly shared between other 
priorities and connections. 

The controller SM also has a 'best effort* mechanism to dynamically bias 
the arbitration in favour of long queues for applications that do not require strict 
bandwidth enforcement 

The routers provide an aggregation function for multiple trajBic managers 
into a single switch port When flie controller SM grants a connection to a 
particular ^^^'^tch port through a particular priority, the ^ropriate router 
must choose one of up to eight unicast and one multicast trafGc manager queues 
to service. This is accomplished through a weighted round-robin mechanism, 
which can select a queue based on a combination of i&g^ess queue length. These 
may allow for &vouring of long queues over shorter ones, or allows trafBc 
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xnanager to t cmp or dri ly increase fhe weighting of a queue via the urgency field in 



management and tra£5c shaping are features takm into account The sensitivity of 
the weighting function to these parameters are determined for each priority and, 
together with the bandwidth allocations, may be altered dynamically. 

The system implements three levels of backpre^ure, described in more 
detail below. These are flow, trgfBc management and core backpressure. Flow level 



level is accomplished by the trafBc managers sending packets of data through the 
system and hence is beyond the scope of this docunient Flow level backpressure 
packets will appear to the system to be no different than data packets and as such. 



So far as trafBc level backpressure is concerned, the system is organised 
to manage its data-flow at the trafBc manager level of granularity (with four 
priorities). Further granularity is achieved at the trafBc manager itself An e^^ess 
traffic manager can send badqnessure information to the egress side of the router 
over the CSIX tnter&ce, multiplexed with the datastream. Smce the egress side of 
the router has jiist a single queue per trafBc manager, ttiis is just a one-bit signal. 
Baclqxressure between routes is signalled via a dedicated broadcast mechanism in 
the switching controller and switching matrix. There are a number of thresholds 
in the egress buffer queues. When a threshold is crossed, the egress router signals 
the controller with a backpressure broadcast request In the controller, such a 
request stalls the arbiter at the end of the current cycle and the controller issues a 
one vector broadcast connection to the switching matrix planes and informs the 



the CSIX heador. Queue bandwidfli allocation is also a &ctor, jdetermined by flie 
external system controUo^ or by dynamic intervrntioTL Finally, target congestion 




are transparent 
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requ^ting egress router .Tbe egress router tfaea sends one vector*s worth (10 bytes) 
of egress buffer status tfarou^ fte matrix to &e ingress routers. The controller then 
continues the mteirupfted cycles. In the event of several egr^ routers 
simultaneously requesting a backpressure broadcast, the controller will satisfy all 
the requests in a siirq)le round-robin manner before resuming normal service. The 
latency introduced in the backpressure mechanism due to this contention does not 
afifect the egress buffering since during this period a router will onfy be receiving 
backpressure data from other routers, which does not need to be queued. 



An egress router will aggregate the threshold transitions fixmi all its e^ess 
queues, which have occurred during a switch cycle into one badq>ressure broadcast 
so that the maximum niunber of backpressure broadcasts betwem two tensors, is 
limited to the number of routers. When an ingress router in^gess receives a 
backpressure broadcast vector of the form shown in Figure 8, it uses it to update 
the4agFe« queue weightmgs as ^propriate. 

Two modes of backpressure signalling between egress and ingress routers 
are supported, namely start/stop and multi-state signalling. Multi-state signalling 
allows the egress router to signal the muhi-bit state of all its queues (1 byte per 
queue). This multi-state backpressure signalling coupled with weighted-round- 
robin scfaeduhng in the ingress routers minimises the probability of^e^ess queues 
being full, which is significant when attempting to forward multicast or broadcast 
trafBc in a heavily utilised switch. 



The ingress router signals stop/start baclqpressure to the ifig^ess tra£Sc 
managers via the CSEX inter&ce. This provides a 16-bit backpressure signal to 
allow the ingress router to identify the ingres s queue to v4iich the signal relates. 
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£ gre33 qu eue thresholds are set globally^ whilst ^fe^ress queue tiiresholds are set 
per queue. 

The controller does not keep track of fte state of the egress router buffers. 



to prevent egress buffer overflows, by preventing the controlla: from scheduling 
any traflBc to a particular egress router in the event that all of the respective 
routers* buffers are full. 

Multicast in this system chipset is implemented through the optimal 
replication of tensors at ingress and egress. An ingress router has one multicast 
queue per egress router per priority. An ingress multicast tensor (see Figure 8) is 
created in each of the appropriate queues with the egress multicast masks in the 
target fields TM of the core headers. Eachfeisor, of which three are shown, has a 
length equal to 6 or 8 vectors and a width of 10 bytes. A backpressure vector BPV 
may be inserted between adjacent tensors as shown. The multicast tensors are then 
forwrarded through the core in the same way as for imicast and the egress routers' 
then replicate the tensors into the required egress buffers in parallel. This multicast 
mechanism is intended to provide optimum switch performance vn^h a mix of 
unicast and multicast tiafGc. In particular, it maintains the efBciency and feimess 
of the scheduling and arbitration allowing the switch to provide consistent quality 
of service. 

Hie sytem provides a loss-less ^bric, therefore multicast tensors cannot be 
forwarded througjh the switching matrix unless all the destination queues are not 
full. In a heavily utilised switch if only stop/start backpressure from the^e^ess 
queues was implemented, this could severely restrict the useable bandwidth for 
multicast trafiBc. Two mechanisms are included in the system to improve its 



However, core level backpressure, is in place across the routec/controUer inter&ce 
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multicast performance. These are: L multi-state backpressure from the egress 
router, which reduces the probability of «^ess queues being fiiU^ and 2. increasing 
the weighting of the multic^ ingrecc queues in the weigjhted-round-robin 
scheduler when they have been blocked to increase their chances of being 
scheduled when the block clears. To avoid multicast (and broadcast) being blocked 
by off-line egress ports, the backpressure signals can be individoally masked out 
by an external system controller via the Fabric Management Interfece (FMI). 



The requirement for wire-speed broadcast (benchmarking) is met by having 
a single on-chip broadcast queue in each egress router. When &e controller 
schedules a broadcast connection, the tensor will be routed in the switching matrix 
to all routers in parallel, thus avoiding any ingress congestion (no tensor replication 
at ingress). Broadcast backpressure is provided by having each router inform the 
controller when it transitions in to or out of the state **all-egress-bu£fers-not-£uiri 
The controller will only schedule a broadcast when all egress buffers in all routers 
arenotiulL Broadcast backpressure is a configurable option. If it is not activated, 
the routers do not send status messages and the controller schedules broadcasts on 
demand. Using this method there is no guarantee that the packet will be forwarded 
on all ports. 

The switching matrix is shown in schematic form in Figure 9. It CQn:^>rises 
a high-speed, edge-clocked, synchronous, 16 port dual plane serial cross-point 
switch SCN foriiseinthe system. It has been optimised to provide a scaleable, 
high bandwidth, low latency data movement capability. It operates under the 
control of the controller SM, which sends configuration information to the matrix 
over tiie controUer interface SMI to create connections for the transmission of data 
between routers. The buffer and decode logic BDL receives this information and 
uses it to control the interconnections within the matrix. Data is applied in serial 
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field to connect mgcess and egres s ports. For 4 and 8 port configurations, the 
number of bits of the control port field required is 2 and 3 respectively. 

In CTperation, tiie switching matrix receives configuration information fium 
the controller SM via the controller inter&ce SMI. This information is loaded into, 
and stored in, configuration registers. Routing information is passed in the form of 
a number of encoded fields determining which ii^ut port is to be connected to each 
output port via the switching matrix. In a 16 x 16 matrix, there are 16 ou^ut ports. 
For each output port there is a four bit source address which is encoded to define 
which input port is to be connected to an output port There is also an enable signal 
for each field to signal that the field is valid and a configure signal that indicates 
that the whole inter&ce is valid If a field is signalled as not valid, the output port 
for that field is not connected. If the configure signal is not asserted, the matrix 
does not change its current configuration. The configuration information on the 
controller/matrix interface is loaded into the device when the configure signal is 
asserted. A 16-5tage programmable pipeline is used to delay the configuration 
information until it is required for switching the rxiatrix. Ifthere is a parity error on 
a port then that ports enable signal will be set to zero and a null tensor will be 
transmitted to the output of that port The register that holds the parity error may 
only be loaded when the configure signal is high and is cleared when read by the 
diagnostic imit A parity check is also carried out on the configure signal. If a 
parity error occurs . here then a parity fail condition is asserted, all port enable 
signals are set to zero and all the output ports on the device will transmit null 
tensors. The coimection between the routers and the matrix is via a set of serial data 
streams, each running at one Gbaud. Once a cormection across the matrix has been 
set up, tensors are transmitted between ingress and egress routers. The whole 
proc^ exhibits low latency due to a very small insertion delay. Multiple switching 
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There is also another mechanism in the controlleiAt>uter inter&ce referred 
to as *core level badqir^sure', \^ch prevents the controller from scheduling any 
trafBc to a particular egress router. Arouteruses core level backpressure when all 
its egress bufifers are full. 

The controUetiifr is capable of establishing both \micast and broadcast 
coimections in the switching matrix. It is also capable of dealing wift system 
configurations that contain a mixture of 'full' and 'half speed' ports, for exanq^Ie 
a mixture of lOGbit/sec and 5Gbit/sec routers. 

Figure 12 shows a router device. This is a system port interfece control 
device. Its main function is to support user applications* data movement 
requirements by pro\ading access into and out of the system. There are two 
instances of the mg^ess interface unit nU, one for each of the trafSc managers that 
can be connected to a system porL The UU is responsible for transferring data from 
a traflHc manager into an internal FIFO queue on the router and informing the ICU 
that it has tensors ready to transmit into the system. The extemal inter&ce to the 
trafiBc manager u^gses common system interfece CSDC' This defines an n x 8-bit 
data bus; the ifigFess interface units HU operate in a 32-bit mode. The FIFO is four 
tensors deep to allow one tensor to be transferred to the ICU while subsequent ones 
are being received. 

To generate the tensors, the mg^css interfece unit ^ipends a three byte 
system core header to the CSIX frame prior to passing it, indicating the tensors 
availability to the ICU. The IIU examines the CSIX header to determine whether 
the frame is of type unicast, multicast or broadcast and indicates the type to the 
ICU. If the frame is imicast, the nU sets a single bit in byte I indicating tiie 
destination TM, tiiis is derived from the destination address in the CSIX header. If 
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the frame is multicast, a tensor is constructed and sent for each of tbe 16 system 
ports that have a non-zero CSIX mask. In the case of a broadcast CSDC frame, byte 
I is set to aU Is by the nU. The nu is also respoiisible for calculating the two b^ 
Tensor Error Check which utilises a cyclic redundancy check. 

TrafBc manago: flow control is provided by making each ingress intor&ce 
unit nU responsible for si^^j^^^JrafiSc manager start/stop baclq^ressure 
information to its associated ogrcjj inter&ce unit EIU. The IIU obtains this 
baclq^ressure information by decoding the CSIX control bus. If parity error 
checking has been enabled (the appropriate bit in a status register is set) and the UU 
detects a parity error on CSIX then an error is logged and the corresponding tensor 
discarded. This log can be retrieved via the FMI. 

The is^ess control unit ICU is responsible for accq>ting tensors from the 
•ift^ess interface units nUs, making connection requests to the controller interface 
unit SMIU, storing tensors until the controller grants a connection and then sending 
tensors to transceivers TXR. There are two types of connection requests (and 
subsequent grants). One is used for all unicast/multicast tensors and the second is 
used for broadcast trafBc. For unicast/multicast tensors the control 
tmit/controUer inter&ce unit signalling incorporates the system destination port and 
priority. Clearly for broadcast tensors there is no requirement for a system 
destination address and since there is only one level of broadcast tensors^ a priority 
identifier is also not required. 

■ lag r css buffering is illustrated in Figure 13. This bufT^ing for imicast 
queuing UQ is implemented such that tfiere is one for each po^ble destination 
traf&c manager and priority. In addition to imicast queues, there is a multicast 
queue MQ per port p& priority and a single broadcast queue BQ. Tlie queues are 
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Statically allocated. There are 512 imicast, 64 multirast and one broadcast queue. 
The unicast and multicast queues are located in external SRAM. Tlie queue 
organisation allows flow control down to OCrl2 granularity. Within the unicast 
address field of the CSIX header, 3 bits are allocated for the number of trafiSc 
managers a rout^ can support Since the router supports two trafSc managers^ the 
spare bit field is used for a fimction known as Service Channel. Service Channels 
provide the means of fully exploiting the routers implicit OC-12 granularity 
features. 

in f'ub 

When the xngress control imit ICU receives a connection grant signal fit)m 
the Controller interfece unit SMIU (which specifies egress port and priority), the 
ICU must choose one of up to 8 qualifying unicast queues or the multicast queue 
from which to forward a tensor. This is achieved using a weighted round-robin 
mechanism, that takes into account several parameters. One is the jng r e ss queue 
lengdi, which allows for the favouring of longer queues over shorter ones and 
another is aggregate queue tensor urgency, which allows a traffic manager to 
temporarily increase the weighting of a queue via the urgency field in the CSIX 
header. One further parameter taken into account is queue bandwidth allocation, 
whereby an external system controller or systmi operator can configure ihe system 
to provide bandwidth allocation to individual flows via the FML The final 
parameter considered is that of target egpees-queue backpressure. This requires that 
the effective^performance of the multicast scheme requires that the probability of 
queues being iiill be minimised. The sensitivity of the weighting function 
to the input variables is controUed by four sets of global sensitiviQr variables (one 
per priority). These settings are configured at system initialisation. 

To provide an ia^ess flow control mechanism, the ing r ess control imit ICU 
implemrats three watermark levels to indicate the state of the queues (fairly empty. 
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filling up, feirly foil or very full). The watennaiks have associated hystereis and 
both values are configurable via the FMI. When a queue moves from one state to 
ano&or, the ICU signals the change to each of tte ogress interface units EIUs. In 
addition to this ^multistate* backpressure mechanism it is also possible to invoke 
a second mode of backpressure signalling that involves only start^stqp signalling. 
The backpressure mechanism mode is selected via the FMI by setting the 
watermark levels appropriately. 

The egress control unit ECU signals egress bac]qpressure infonnadon to the 
ICU. This information relates either to the signalling egress router buffers or to 
infomiation the ECU has received about the state of anofiier egress routers buffers. 
If the information relates to the signalling egress routers bxififers the ICU updates 
the backpressure status used by the i^^^ scheduling algoritimi and makes a 
request to send backpressure infomiation to the controller interface unit SMIU. If 
it relates to another egress routers buffers then the ICU simply updates its own 
backpressure status. 

There are two instances of the^-^^ess interfece unit EIU, one fcr each of the 
trafBc managers that are cozmected to a system port The e^ress'inter&ce unit is 
responsible for accepting tensors &om the ECU and transmitting them as frames 
over CSIX to the associated trafBc manager. 

To provide trafBc manager flow control, the e^ess" interface unit EIU 
ao^^ trafBc manager start/stop backpressure information from it's associated 

inter&ce imit IIU (that is, the one cormected to the same trafBc manager). 
If the QU is currently sending a frame to the trafiBc manager, then it continues the 
transfer of the current frame and then waits imtil a start indication is received 
before transferring any subsequent frames. 
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To provid&iogress flow control fhe EIU accepts i ngies s biiffer multistate 
backpressure infonnation from the ICU and sends it inunediately to the tra£5c 
mamanger. 



The egre^ control unit ECU is r^onsible for accepting tensors from tiie 
serial transceivers, when informed by the controller inter&ce imtit SMIU of their 
imminent arrival, and forwarding them to the relevant EIU. The ECU examines the 
traffic manager mask byte of the system core header to determine the correct 
destination EIU. In the case of multicast (or broadcast) tensors, multiple bits are 
set in the mask and the tensors are simultaneously transferred to all the EIUs for 
which a corresponding bit is seL This feature provides wire speed multicasting at 
the egress router. The ECU is responsible for checking the tensor error check bytes 
of the system core header. If the system core error checking has been enabled (i.e. 
the appropriate bit in a status register is set) and the ECU detects an error, then it 



is logged and the corresponding tensor discarded. To provide an egreas flow 
control mechanism the ECU implements three watermark levels to indicate the 
state of the egress buffers (fahly empty, filling up, fairly full or very full). When 
an egress buffo: moves from one state to another the ECU signals the change to the 
ICU. The level of the watermaate is configurable via the FMI. In addition to this 
multistate backpressure mechanism it is also possible to invoke a second mode of 
backpressure signalling that involves only start/stop signalling. The type of 
backpressure mechanism is selected via Ae FMI by setting the watermark levels 
appropriately. 

The controller interface unit SMIU is responsible for controlling the 
interface to tiie controller. Since the controller operates at the system port rather 
than the traffic manager port level of graniilarity, the SMIU also ope^^ at this 
level. The SMIU maintains a count of the number of tensors in the^s^ess queues 
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associated with each destination system port The count is incremented each time 
die SMIU is infonned of a tensor arrival by the ICU and decremented each time the 
SMIU receives a grant from the controller. 

The controller interface unit SMIU contains a state machine that is tightly 
coiipled to a corr^ponding one in the controller. For small numbers of tensors 
(less than about six or seven), the SMIU notifies the controller of each new tensor 
anival. For larger numbers of tensors, the SMIU only informs the controller when 
the count value crosses predefined boundaries. 

The central management unit is common to all devices. Its functions are 
to provide a FMI between each device and an external controller, control error 
management within the device and provide a reset interface and reference clocking 
in to each device. 



Referring back to Figure 12, the routers provide access into the system via 
CSEX iagpess and egpess inter&ces. On receiving a CSDC packet from th o 4 pgrc s9 
traffic manager, over the CSEX interfece ICSIX, the-in^ess interface unit nU 
checks the type and validity of the packet The packet is then wrapped with a core 
header, the contents of which vary with the packet ^e. When the core header has 
been appended, the packet becomes known as a tensor. The iBgress control unit 
ICU makes a request to the controller through the controUerU inter&ce SMI for a 
connection across the switching matrix and stores the tensors until the coimection 
IS created In order to eliminate head of line blocking for unicast trafi5c,^ifigress 
bufTering is organised into separate queues, one for each possible destination traflBc 
manager TMQl to TMQN and priority PI to P4 as shown in Fig. 13, Individual 
queues per priority are not required to avoid head of line blocking but are 
advantageous as they allow the controller to enforce bandwidth allocation to each 
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priori^ in the switch. In addition to the unicast queue there is a multicast queue 
per port per priority and a single broadcast queue. The unicast and multicast 
queues are statically allocated in external SRAM. The purpose of this level of 
buffering is to allow the controller to allocate connections efiBciently by giving it 
a view of the iagfes datastreams and to provide rate matching between the router 
external interfaces and the router/matrix inter&ce. 

When connections are granted, the controller creates a connectioii across 
the switching matrix to the requested egress router at a given priority. The ing r es s 
control unit ICU must now choose one of the qualifying unicast or multicast queues 
from which to forward a tensor to the transceiver for serialisation. This level of 
router scheduling is done on a wei^ted-round-robin-basis. Each unicast and 
multicast queue has wei^ting associated with it, ..which is determined by the 
backpressure from the egress buffers, the queue length, the queue urgency and the 
static bandwidfli allocation. On the egress side the controllertf informs the router 
of -a tensor's inmiinmt arrival. The o^Sac ontrol iinit ECU receives this tensor and 
examines the core header to see which trafBc manager to send the tensor to. 
Tensors are then assembled back into datastreams and forwarded via CSIX to the 
appropriate traffic manager. 

Multicasting in the system is achieved by the optimal replication of tensors 
at the ingress and egress. On the ingress side a router has one multicast queue per 
egress router at each priority. Multicast routing information is appended on the 
ingress side and on arrival at the egress side these masks determine the replication 
of tensors into the required egress buffers. Broadcast in the system is achieved by 
having a single on chip broadcast queue at the ingress of each router. When the 
controller schedules a broadcast connection, the tensor will be routed by the matrix 
to all egress routers in parallel, thus avoiding any ingress congestion. 
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Eacfa system device contains a logic block known as the &bric management 
into&ce unit (FMTU). The FMIU interfeces to the fiinctianal logic, also known as 
the core, within the device in order to provide run-time (readAvrite) acce^ to a 
chosen subset of the registers and RAM locations, a mechanism to rq^ort run-time 
fail conditions detected by the device, and scan access (readAvrite) to tfie total set 
of registers in the functional logic while the functional logic is not operational. 

The external interface to the &bric management inter&ce unit FMIU 
requires a number of inputs, including a Hard Reset inpxit which sets the system 
device into a known state. In particular, it sets the device into a state where the 
FMIU is fully functional and the serial inter&ce can be used Hard Reset is 
e7g>ected to be srpplied when power is first applied to the device, and may also be 



applied at other times. The external inter&ce also has a serial iaapt and serial output 
lines and a device locator address field used to identify a particular instance of a 
device. The device locator field is generated by tie-ofi& that are detemiined by the 
devices physical position in the system. 

The main functions of the central management unit (CMU) shown in 
Figure 12 include eiror detection and logging logic. This is responsible for 
detecting enror conditions and states within the chip or on its inter&ces. As such, 
its functionality is spread throughout ttie design and is not concentrated within a 
specific block. Errors are reported and stored in the Error and Status registers and 
logs, which are acc«sible across the FMI. The CMU also has reset and clock 
generation logic responsible for the generation and distribution of clocks and reset 
signals within the device. In addition, the CMU contains test control logic which 
controls the mechanisms biult in for chip test The target &ult coverage is 99^%. 
This logic is not used under normal operating conditions. The final function of the 
CMU is to provide febric management logic common to all of the system devices 
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Claims 



1. A method of transmitting infonmation through a data switching apparatus connected 
to a plurality of input line end devices (ILEoo, ILE01, ILEom) and output line end devices 
(ELEoo. ELE01, ELEom), said input line end devices (ILEco, ILE01. ILEom) transmitting 
protocol information packets to the data switch for transmission to specific output line 
end devices (ELEoo, ELE01. ELEcm). 

the data switching apparatus comprising a plurality of input traffic manager units 
(ITMo. ITM1, ITMn). a plurality of output traffic manager units (ETMo. ETMi, ...ETMn) and 
a data switch (SW), the data switch (SW) comprising a plurality of input routers (SRIo, 
SRI1, ...SRIp). a plurality of output routers (SRE©, SREi, ...SREp). and a memory-less 
cyclic switch fabric (SCM), and a switch controller (SM), said switch fabric being 
controlled by said switch controller (SM), said input traffic manager units (ITMo. ITM1, 
ITMn) being connected to one or more of said input line end devices (ILEoo, ILEqi, ILEom), 
and said output traffic manager units (ETMo. ETMi. ...ETMn) being connected to one or 
more of said output line end devices (ELEoo, ELE01, ELEo^). 

each input traffic manager unit (ITMq, ITMi, ITMn) being arranged to convert the 
protocol infomiation packets it receives from the respective input line end devices 
(ILEoo, ILE01, ILEom) into fixed length cells having a header (UH), said header (UH) 
indicating the output traffic manager unit (ETMo, ETMi. ...ETMn) connected to the output 
line end device (ELEoo. ELE01, ELEom) to which the cell should be sent. 

each input router (SRIq, SRIi, ...SRIp) being arranged to receive celts from a 
respective group of said input traffic manager units (ITMo. ITM1. ...ITMn), and to maintain 
virtual output queues for each output traffic manager unit (ETMo, ETMi, ...ETMn). 

each output router (SREo, SREi. ...SREp) being arranged to transmit cells to a 
respective group of said output traffic manager units (ETMo. ETMi. ...ETMn): 

the method comprising, on the anival of a cell from an input traffic manager unit 
(ITMo. ITM1, ITMn), the Input router (SRIo, SRU, ...SRIp) examining the cell header (UH), 
placing it in a virtual output queue for the output traffic manager unit (ETMo, ETMi. 
...ETMn) indicated by the cell header (UH), generating a transfer request (RFT) including 
the address of the output traffic manager unit (ETMo, ETMi, ...ETMn) indicated by the 
header (UH) of that cell, and passing said request (RFT) to the switch controller (SM), 

characterized in that: 

said cell headers (UH) include message priority information, and said transfer 
requests (RFT) include a priority code; 
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the switch fabric (SCM) is controlled by the switch controller (SM) to connect ones 
of said input routers (SRIo, SRI1, ...SRIp) to ones of said output routers (SREo. SREi, 
...SREp); 

the switch controller (SM) schedules the passage of the cells across the switch fabric 
(SCM) at each switch cycle, by using a first arbitration process to select which of said 
input routers (SRIo, SRh. ...SRIp) to connect to which of said output routers (SREo. 
SREi, ...SREp). and controls the switch fabric to connect the selected input routers 
(SRIo, SRIi. ...SRIp) to the coaesponding selected output routers (SREo, SREi, ...SREp); 
and 

upon it being determined that a given input router (SRIo. SRI1, ...SRIp) is to be 
connected to a given output router (SREq, SREi, ...SREp): 

that given input router (SRIo, SRI1, ...SRIp) performs a second arbitration process to 
select a single virtual output queue, from among the virtual output queues for the output 
traffic manager units (ETMo. ETMi. ...ETMn) to which the given output router (SREq, 
SREi, ...SREp) sends cells, and transmits the cell at the head of the selected virtual 
output queue across the switch fabric (SCM) to the given output router (SREo, SREi, 
...SREp). and 

the given output router (SREo. SREi, ...SREp) transmits the cell to the output traffic 
manager unit (ETMo, ETMi,.. .ETMn) indicated by the cell header (UH). 

2. A method according to claim 1 in which each input router (SRIo, SRIi. ...SRIp) 
maintains a virtual output queue for each output traffic manager unit (ETMo, 
ETMi.... ETMn) and priority level, and upon receipt of a cell the input router (SRIo, SRIi, 
...SRIp) places the cell In the virtual output queue for the priority and output traffic 
manager unit (ETMo. ETMi„..ETMn) indicated by the cell header (UH). 

3. A method according to claim 1 or 2 in which each output router (SREo, SREi, 
...SREp) maintains an output queue for each of the group of output manager units 
(ETMo, ETMi.. ..ETMn) to which it transmits cells. 

4. A method according to any preceding claim in which each input router (SRIo. SRIi, 
...SRIp) maintains an input buffer for each of the group of input traffic manager units 
(ITMo, ITM1. ...ITMn) from which it receives signals 

5. A method according to any preceding claim in which said second arbitration process 
performed by the given input router (SRIo. SRIi, ...SRIp) is a weighted round-robin 
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arbitration process based upon: the length of said output virtual queues of the given 
input router (SRIo. SRIi. ...SRIp); an aggregate queue packet urgency; and a 
backpressure from said output traffic manager units (ETMo, ETMi,...ETMn). 

6 A method according to any preceding claim in which the first arbitration process 
selects which input routers (SRIo. SRh, ...SRIp) and output routers (SREo, SREi, 
...SREp) to connect, to maximise the number of said requests (RFT) which can be 
satisfied.. 

7. A data switching apparatus for connection to a plurality of input line end devices 
(ILEoo, ILEoi, ILEom) and output line end devices (ELEqo, ELEoi, ELEom) to transmit 
protocol Information packets received from said input line end devices (ILEoo. ILEoi. 
ILEom) to specific output line end devices (ELEoo> ELEqi, ELEom). 

the data switching apparatus comprising a plurality of input traffic manager units 
(ITMo. ITMi. ITMn), a plurality of output traffic manager units (ETMo, ETMi, ...ETMn) and 
a data switch (SW), the data switch (SW) comprising a plurality of input routers (SRIo. 
SRIi, ...SRIp). a plurality of output routers (SREo, SREi. ...SREp), a memory-less cyclic 
switch fabric (SF), and a switch controller (SM). said switch fabric being controlled by 
said switch controller, each of said input traffic manager units (ITMq, ITMi, ITMn) being 
for connection to one or more of said input line end devices (ILEoo. ILEoi, ILEom)- and 
each of said output traffic manager units (ETMo, ETMi....ETMn) being for connection to 
one or more of said output line end devices (ELEoo, ELEoi. ELEom), 

each input traffic manager unit (ITMo, ITMi. ITM„) being arranged to convert the 
protocol information packets it receives from the respective input line end devices 
(ILEoo. ILEot. ILEom) into fixed length cells having a cell header (UH). said cell header 
(UH) indicating the output traffic manager unit (ETMo, ETMi,...ETMn) connected to the 
output line end device (ELEoo. ELEoi. ELEom) to which the cell should be sent, 

each of the input routers (SRIo, SRIi, ...SRIp) being arranged to receive cells from a 
respective group of said input traffic manager units (ITMo, ITMi. ITMn), to maintain a set 
of virtual output queues for each output traffic manager unit (ETMo, ETMi,...ETMn), and. 
on the anival of a cell from an input traffic manager unit (ITMo, ITMi, ITMn), to examine 
the cell header (UH), to place it in a virtual output queue for the output traffic manager 
unit (ETMo, ETMi,..,ETMn) indicated by the cell header (UH), to generate a transfer 
request (RFT) including the address of the output traffic manager unit (ETMo, 
ETMi,...ETMn) indicated by the header (UH) of that cell, and to pass said request (RFT) 
to the switch controller, 
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each output router (SREo. SREi, ...SREp) being connected to a respective 
group of said output traffic manager units (ETMo, ETMi,...ETMn); 
characterized in that: 

each output router (SREo. SREi, ...SREp) is arranged, upon receipt of a cell having a 
header (UH) which indicates one of that group of output traffic manager units (ETMo, 
ETMi,...ETMn), to transmit the cell to that indicated output traffic manager unit (ETMo, 
ETMi,...ETMn): 

said input traffic manager units (ITMo. ITMi, ITMn) are arranged to include message 
priority information in said cell headers (UH), and said input routers (SRIo. SRh, ...SRIp) 
are arranged to include a priority code in said transfer requests (RFT); 

the switch fabric (SCM) is arranged, under the control of the switch controller (SM), 
to connect ones of said input routers (SRIo, SRI1, ...SRIp) to ones of said output routers 
(SREo. SREi, ...SREp): 

the switch controller (SM) is arranged to schedule the passage of the cells across 
the switch fabric at each switch cycle, by using a first arbitration process to select which 
of said input routers (SRIq, SRIi, ...SRIp) to connect to which of said output routers 
(SREo. SREi, ...SREp), and control the switch fabric to connect the selected input 
routers (SRIo. SRh. ...SRIp) to the corresponding selected output routers (SREo. SREi, 
...SREp); and 

each Input router (SRIo, SRI1. ...SRIp) is an^nged. upon it being determined that that 
input router (SRIo, SRI1. ...SRIp) is to be connected to a given output router (SREo. 
SREi, ...SREp), to perform a second arbitration process to select a single virtual output 
queue from among the virtual output queues for the output traffic manager units (ETMo. 
ETMi, ...ETMn) to which the given output router (SREo, SREi, ...SREp) is connected, 
and to transmit the cell at the head of the selected virtual output queue across the switch 
fabric (SF) to the given output router (SREo. SREi, ...SREp). 

8. A data switching apparatus according to claim 7 in which, each input router (SRIo, 
SRI1, ...SRIp), is arranged to maintain a virtual output queue for each output traffic 
manager unit (ETMo, ETMi. ...ETMn) and priority level, and the input router (SRIo. SRI1, 
...SRIp) is arranged to place a received cell in the virtual output queue for the priority and 
output traffic manager unit (ETMo, ETMi, ...ETMn) indicated by the cell header (UH). 

9. A data switching apparatus according to daim 7 or 8 in which each output router 
(SREo. SREi, ...SREp) is arranged to maintain an output queue for each of the group of 
output manager units (ETMo. ETM,. ...ETMn) to which it can send cells. 
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10. A data switching apparatus accx)rding to any of claims 7 to 9 in which each input 
router (SRIq. SRh, ...SRIp) is arranged to maintain an input buffer for each of the group 
of input traffic manager units (ITMq, ITMi, ITMn) from which it receives signals 

11. A data switching apparatus according to any of claims 7 to 10 in which said second 
arbitration process is a weighted round-robin arbitration process based upon: the length 
of said output virtual queues of the given input router (SRIq, SRIi, ...SRIp); an aggregate 
queue packet urgency; and a backpressure from said ou^ut traffic manager units 
(ETMo, ETMi. ...ETMn). 

12. A data switching apparatus according to any of claims 7 to 1 1 in which the first 
arbitration process selects which input routers (SRIo, SRh, ...SRIp) and outpxA routers 
(SREo, SREi, ...SREp) to connect, to maximise the number of said requests (RFT) which 
can be satisfied. 
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PATA SWTTCHTNG METHO D ANT) APPAT?ATTTg 

This invention relates to a method of data switching which takes 
appUcation data from numerous input sources and routes it to numerous destination 
outputs and to apparatus for performing such switching. 

In a generalisation of such a concept, data arriving on input ports is 
routed via a non-blocking cross bar switch to ouQjut ports. For an input N to 
transfer data to an output M the switch estabhshes a 'connection' between N and M. 
The connection generally remains for the duration of the data transfer at which 
point it may be brok^ and the output allowed to be connected to another input. 
Data is typically transferred in 'cells'. 

Because there are numerous inputs competing for numerous ouQjut 
ports the possibility of contention occurs. The output port can be considered to be 
a resource that must be shared amongst multiple inputs. This means that a 
particular input may not be able to connect to a particular output because that 
output is ah-eady in use i.e. is aheady connected to another port. It is also possible 
that more than one input may be requesting a connection to the same output In 
either case the result is the need for the cells or data products to queued (buffered) 
until the relevant resource becomes available. 

Cells can be stored m several areas in the switch; the input, the ou^ut 
and centrally. Most switches use a combination of all three. It is generally 
considered that output buffering provides the most efBcient way for handling 
trafBc shaping i.e. the profile of the release of cells from the switch. However, 
output buffering places severe requirements on the actual storage device used to 
create the buffer. This is because the buffer is shared amongst multiple inputs 
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which means that the storage devices must be very high performance. Hence, at 
very higt data rates current technology limits the use of ouQ)ut buffers. 

It is an object of the invention to provide a data switching method and 
apparatus for the more efScient handling of packets of information through a data 
switch. 

According to a first aspect of the invention there is provided a method 
of handling packets of information through a data switch comprising input traffic 
controUers, ingress routers, a memoiyless cyclic switch febric, egress routers and 
output traffic controllers all under the control of a switch controller and 
interconnected such that each input line connected to the data switch is terminated 
on a traffic controller arranged to convert the input line protocol information 
packets into fixed length cells having a header defining the data switch destination 
router and output traffic controller together with, message priority information 
arranged such that each ingress router serves a group of traffic controllers 
characterised \j\ that the ingress router includes a set of input buffers one for each 
input line and a set of virtual ouQ)ut queue buffers, one for each output traffic 
controller from the data switch, and in which the method comprises on the arrival 
of a cell fi-om a traffic controller the ingress router examines the cell header and 
places it in the appropriate virtual output queue and generates a request for transfer 
message consisting of the destination traffic controller address and a message 
priority code which is passed to the data switch controller, the switch controller 
schedules the passage of the cells across the switch fabric by interconnecting a 
specific ingress router to a specific egress router for each switch fabric cycle in 
accordance with a first arbitration process the ingress router selecting from the 
appropriate virtual ou^ut queue the cell at the head of the queue for passage across 
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the data switch to the appropriate output traffic controUer in accordance with 
second arbitration process. 



According to a second aspect of the invention there is provided a data 
switch for handling packets of information comprising input tra£5c controUers, 
ingress routers, a memoiyless cycUc switch febric, egress routers and ou^ut traffic 
controUers aU under the control of a switch controUer and interconnected such that 
each input line connected to the data switch is terminated on a traffic controUer 
arranged to convert the input line protocol information packets into fixed length 
ceUs having a header defining the data switch destination router and output traffic 
controller together with message priority information arranged such that each 
ingress router serves a group of traffic controUers characterised in that the ingress 
router includes a set of input buffers one for each input Une and a set of virtual 
output queue buffers, one for each output traffic controller connected to the data 
switch, and in which on the arrival of a ceU from a traffic controUer the ingress 
router examines the ceU header and places it in the appropriate virtual output queue 
and generates a request for transfer message consisting of the destination traffic 
controUer address and a message priority code which is passed to the data switch 
controUer, the switch controUer schedules the passage of the ceUs across the switch 
fabric by interconnecting a specific ingress router to a specific egress router for 
each switch fabric cycle in accordance with a first arbitration process and the 
ingress router selects fi-om the appropriate virtual output queue the ceU at the head 
of the queue for passage across the data switch to the appropriate output traffic 
controller in accordance with a second arbitration process. 

The invention together with its various features wiU be more readily 
understood firom the foUowing description which should be read in conjunction 
with the accompanying drawings, in which:- 
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Fig. 1 shows a generalised concept of the prior art. 
Fig. 2 shows in block diagram fonn one embodiment the data switch 
of the invention. 

Fig. 3 shows the switch fabric of the embodiment of the invention. 

Fig. 4 shows the flow of data through the switch fabric. 

Fig. 5 shows ATM frame headers when passing through the switch 



fabric. 



fabric. 



switch. 



Fig. 6 shows Ethemet frame headers when passing through the switch 
Fig. 7 shows the scheduling and arbitration arrangements of the data 



Fig. 8 shows an egress backpressure broadcast. 
Fig. 9 shows the switch block diagram. 
Fig. 10 shows the detail of the switch block. 
Fig. 1 1 shows a block diagram of the master according to the 
embodiment of the invention. 

Fig. 12 shows a block diagram of a router accordmg to the 
embodiment of the invention, whilst 

Fig. 13 shows the queue structure. 



Referring now to Figure 1, this shows the general concept of a data 
switch. Inputs Nl to Nn are connected to respective input ports IPl to IPn of a data 
switch SW. The switch has ouptput ports OPl to OPn connected to respective 
outputs Ml to Mn. 



With intelligent distributed scheduling mechanisms it is possible to 
ate an input buffered switch v/bich meets the same trafSc shaping efSciency of 
output buffered counterpart. The use of input buffers is preferred for several 
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reasons. Input buffering requires smaller buffers, which can have relatively low 
performance and therefore be cheaper. 

When cells are queued at the mput there is the possibility of 
contention arising through the phenomena of Head Of Line (HOL) blocking. This 
generally occurs when Fhrst In First Out (FIFO) queue mechanisms are used. Ihe 
FIFO queues the cell at the head of the queue and this is the only one that can be 
chosen for delivery through the switch. Now, consider the case where an input port 
has three ceUs cl, c2, c3 stored such that cl is at the head of the queue with c2 
stored next and c3 last with cell c 1 destined for port N and cell c2 destined for port 
N+1. Now port N is aheady connected to port N-1 therefore cl cannot be 
switched, however port N+1 is unconnected and therefore c2 could actually be 
delivered. However, c2 cannot get out of the FIFQ because it is blocked by the 
HOL i.e. cl. An inteUigent approach to the solution of HOL blocking is the 
concept of Virtual Output Queues (VOQ). Using VOQs the cells are separated out 
at the mput into queues which map directly to their required output destination. 
They can therefore be effectively described at being output queues, which are held 
at the input i.e. virtual Output Queues. Since the ceUs are now separated out in 
terms of their ou^ut destination they can no longer be blocked by the HOL 
phenomena. 

There is also the question of QuaUty Of Service (QoS) to address. 
Different input sources have different requirements in terms of how then- data 
should be delivered. For example voice data must be guaranteed to a very ti^tly 
controUed deUvery service whereas the handling of computer data can be more 
relaxed. To accommodate these requirements the concept of priority can be used. 
Data is given a level of priority, which changes the way the switch deals with it 
For example consider two cells in different VOQs cl and c2 which are both 
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requesting to go to the same ou^uL Although either could be selected only one 
can be delivered. The ceU with the liighesf priority is chosen. This decision 
makingprocess is referred to as "arbitration". It is not only priority which can be 
a fector in the arbitration process. Another example would involve monitoring the 
length of the VOQs and also using them as a detennining factor. It should also be 
noted that as switches become fester and larger then a more intelligent approach to 
arbitration needs to be sought. The ideal solution is for a distributed arbitration 
mechanism where there exists levels of arbitration right through the switch from 
the core right back to the inputs. Using such a mechanism arbitration can be very 
finely tuned to cater for the most demanding quality of service requirements. By 
using buffers switches the system runs the risk of losing cells i.e. the buffer 
overflows. To overcome this problem and also to efficiently size the buffers the 
concept of baclqjressure flow control across the swjtch can be employed. Using 
baclq^ressure an output can inform the input that is connected to it that it is filling 
too quickly and is about to lose cells. The input can now back off or slow down the 
rate at which it is sendmg the cells and therefore reduce or completely eliminate the 
risk of cell loss. 

This specification describes the implementation of a high-speed 
digital switch for use in any area in which high speed high performance digital 
communications is required. Typically this definition covers at least the Data 
Communications sector and the Cluster Computing sector. 

The embodiment of the invention shown in block diagram form in 
Fig. 2 is centred on a switch febric that is intended for use in a broad range of data 
switching applications. Although the invention may be used in a variety of 
appUcations. the rest of this description will only focus on tiie data communications 
environment 



WO00/3837S PCT/GB99/0374S 

7- 



Referring now to Figure. 2, the main feature is the data switch SW. 
Inputs are provided to the switch from ingress traffic manager units ITMo to ITM„ 
Each ingress traffic manager may have one or more input line end devices (ILE) 
connected to it Oulputs from the switch SW are connected by way of egress traffic 
manager units ETM^ to ETM„ to egress line end devices (FT P). 

The traffic manager units (TIM and HTM) provide the protocol- 
specific processing in the switch, such as congestion buffering, ingress traffic 
policing, address translation (ingress and egress) and routing (ingress), traffic 
shaping (ingress or egress), collection of traffic statistics and line level diagnostics. 
There may also be some segmentation and re-assembly functionality within a 
traffic manager unit. The hne end devices (ILE and ELE) are foil-duplex devices 
and provide the switch port physical interfaces. Typically, line end devices will be 
operated in synchronous transfer mode, ranging from OC-3 to OC-48 rates or 
10/100 and Gigabit Ethernet 

The switch SW provides the application independent, loss-less 
transport of data between the traffic managers based on routing infonnation 
provided by the traffic managers and the connection allocation policies determined 
by the switch control SC. This controls the global fonctions of the switch such as 
connection management, switch level diagnostics, statistics collection and 
redimdancy managemoit 



The switching system just described is based on an input-queued non- 
blocking crossbar architecture. A combination of adequate buffering, hierarchic 
flow control, and distributed scheduling and arbitration processes ensure loss-less, 
efficient, and high performance switching capabiUties. It should be noted that the 
ingress and egress fonctions are shown separately on either side of the drawing. 
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In reaUty, traffic manager units , line-end devices ingress and egress ports may be 
considered full duplex 



one 



Fig. 3 shows the basic architecture of the switch according to 
embodiment of the invention. The ingress traffic manager units ITM described 
above connect streams of data to a number of ingress routers SRIo to SRIp. These 
routers are connected to the switching matrix SCM, which is itself controlled by 
a switch controller SM. Data outputs from the switching matrix SCM are passed 
by egress routers SREo to SREp and on to the egress traffic manager units ETM. 

Tlie ingress routers SRIo to SRIp on the ingress side collect data 
streams from the ingress traffic manager units ITM, request connections across the 
switching matrix SCM to the confroUer SM, queue ^p data packets (referred to as 
'^tensors") until the controller SM grants a connection and then sends the data to the 
switching matrix SCM. On the egress side, the egress routers SREo to SREp, sort 
data packets into the relevant data streams and forward them to the appropriate 
egress traffic manager units ETM. Each ingress and egress router communicates 
on a point to point basis with two traffic manager units over a common switch 
interface. Each interfece is 32 bits wide (foil duplex) and can operate at either 50 
or lOOMHz. Throu^ its common interfeces the routers can siqjport up to 5Gbs of 
cell-based traffic such as ATM or 4Gbs of packet based traffic such as gigabit 
Ethernet These 4 or 5Gbs of data share a small amount of external memory. 

The switch controUer SM takes connection requests from the ingress 
routers and creates sets of connections in the switching matrix SCM. The 
controller SM arbitration mechanisms maximise the efficiency of the switching 
SCM vme maintaining feimess of service to the routers. The controUer SM is able 
to configure one-to-one (unicast) and one-to-aU (broadcast) connections in the 
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switching matrix SCM. The controUer SM selects an optimal combination of 
connectiomtoestablishinthematrixSCMonceperswitchingcycle. Tlie selection 
can be postponed by one (or more) backpressure broadcast requests that are 
satisfied in a round-robin feshion before allowing normal operation to resume. ITie 
arbiter also uses a probabilistic work-conserving algorithm to allocate bandwidth 
in the switching matrix to each priority according to infonnation defined by the 
external system controller. 

Theswitchingmatrix SCM itselfconsistofa number of memoiy-less, 
non-blocking matrix planes SCMl - N and a number of embedded serial 
transceivers to interface to the routers. The number ofmatrixplanes in a particular 
switch depends on the core throughput required across the matrix. The core 
throughput will be greater than the aggregate of the ?xtemal interfaces to allow for 
inter-router communication, core header overfieads and maximal connections 
during the arbitration cycles. The device is packaged with two planes of sixteen 
ports, which can be configured to provide an alternative number of planes/ports. 
The multiple serial links that comprise the data path between the router and 
switching matrix are switched simultaneously and therefore act as a single fiill 
duplex fet pipe of 8Gbps. The switching matrix has the novel feature that it can be 
configured as a 'NxN' port crossbar device where N can be 4, 8 or 1 6. This feature 
can increase the number of planes per package and therefore allows a wide range 
of systems to be realised cost effectively. For example using the first generation 
chip set systems of less than 20Gbs up to 80Gbs can be easily configured. 

Underlying the management of the system is the fabric management 
interfece FMl, which provides an external orthogonal interface into aU of the 
system devices. This level of management provides readAvrite access to a chosen 
subset of inqjortant registers and RAMs while the device is functioning normally. 
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and will provide access to aU the registers and RAMs in a device (using the scan 
mechanism) if the system is inoperable. Management access can be used for the 
purposes of system initialisation, and dynamic reconfiguration. TTie following 
features need to be configured via the febric management interface FMI at system 
reset but can be modified on a live system: ingress router queue parameter 
sensitivities, ingress and egress queue thresholds, bandwidth allocation tables in the 
routers and switch controUer and status information. Each device has a primary 
status register, which can be read to obtain the high-level view of the device status; 
for example, the detection of non-critical failures. If necessary, more detaUed 
status registers can then be accessed. 

If a device or the total system feils, fabric interface management access into 
the chip-set is still possible. This will normally provide useful information in 
diagnosing the fault. It can also be used to perform low-level testing of the 
hardware. 



Detailed error management facilities have been bmlt into the system. The 
management of errors can be considered under the headings of detection, 
correction, containment and reporting as described below:- 

a) Detection. Within the system, all interfaces between devices are checked 
as follows:- parallel interfeces between devices are protected by parity. Serial data 
being routed firom one router to another via die switching matrix is protected by a 
sixteen bit cyclic redundancy code. This is generated in the ingress router and 
forms part of the tensor. It is checked and discarded at the egress router. All 
external interfaces support parity and the common switch interface specification 
includes optional parity. This is implemented at the system end of the interface. 
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Error checking routines are automatically performed during system initialisation. 
The FMI protocol includes parity in all of its messages. 

b) Correction, If an error is detected in a tensor, because either the d^ is 
feulty or the tensor has been misrouted, the system will not correct the error. The 
tensor is discarded and it is left to a higher level of protocol to carry out any 
necessary corrective actions. Where errors are detected on certain control 
interfaces, retries are attempted without any external intervention in order to 
distinguish between a transient and permanent feilure. The fault is reported via the 
FMI in either case. 

c) Containment The principle of containment is to limit the effect of an error 
and, as far as possible, continue normal operation. Fpr example if a feult is detected 
in a particular tensor, that tensor is discarded but the system carries on operating 
normally. Similarly, if a permanent fault is detected that affects one traflSc manager 
unit or router, that part of the system is disabled whilst the rest of the system 
continues without a break in service. This may require system management 
assistance. If redundancy were employed in the system, then at this point the 
standby device(s) would become operational. 

d) Reporting. AH faults which allow the reporting infrastructure to continue 
functioning are logged and reported to the diagnostic system. The device primary 
status register has a mechanism for reporting different classes of fault separately, 
so that any necessary action can be quickly determined, 

e) Monitoring. In addition to error monitoring, the system contains logs to 
collect performance monitoring and statistics information. These can be 
dynamically accessed. 
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There are certain units within the system that are common to ail devices. 
The two units of most interest are the Central Management Unit and the Fabric 
Management Inter&ce. 

Data is handled throughout the system in fixed length cells. There are 
several reasons for using fixed length cells, one of which is that the quality of 
service (QoS) is easier to guarantee when the switch is reconfigured after every 
switch cycle. In addition, the packet latency is improved for both long and short 
packets and the buffer management is simplified. In practice, there are sUght 
variations in the format of the cells, due to the need to include steering information 
in headers at various points. Figure 4 shows the flow of data through the switch 
fabric and the functions performed by the seven steps shown in the diagram are 
detailed below:- 

Firstly, packets received fix»m a line end are, where necessary, segmented 
in a ingress trafBc manager ITM and formed into cells of the correct format to be 
transferred over the common interface, denoted in Figure 4 as CSIX; 

Secondly, at the ingress router SRI, arriving cells are examined and placed 
in the appropriate queue. There are several sets of queues, shown here as unicast 
queues UQ, multicast queues MQ and broadcast queues BQ.. In the diagram the 
cell has been placed into one of the unicast queues; 

Thirdly, the arrival of a cell triggers a 'request for transfer' RFT to the 
controller SM. The cell will be held in the queue until this request is granted; 

At step 4, the controller SM executes an arbitration process and determines 
the maximal connection set that can be established within the switching matrix 
SCM for the next switching cycle. It then grants the 'request to transfer' RTT and 
signals the egress router SRE that it must expect a cell. 

At the fifth step, the ingress router SRI, having been granted a connection, 
also executes an arbitration process to determine which cell will be transferred. 
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The cell is transfeired through the memory-less switching matrix SCM and into a 
buffer in the egress router SRE. 

As shown at step 6, there is one egress buffer per egress traflBc manager 
ETM and arriving cells are examined and placed in the ^jpropriate traffic manager 
queue in the egress router SRE. 

Finally, at step 7, the cell is transferred to the egress traffic manager EME 
over the standard interface CSK and, where necessary, re-assembled into a packet 
before onward transmission. 

The transfer of data through the system is packaged in cells temied tensors. 
An arbitration cycle transfers one tensor per router through the switching matric 
SCM. Each tensor consists of 6 or 8 vectors. A vector consists of one byte per 
plane of the switching matrix and is transferred through it in one system clock 
cycle. The sizes of the vector and tensor for a particular appHcation are determined 
by the bandwidth required in the fabric and the most appropriate ceU size. The 
foUowing sections show the typical packaging of the data as it flows through the 
system for ATM and Ethernet 



As shown m Figure 5a, illustratmg the ASTM application, payload cells 
P containing fifty three bytes of data arriving from an ingress traffic manager ITM 
across the interfece CSIX are re-packaged mto 60-byte tensors (6 vectors of 10 
bytes). The ingress router analyses the CSDC header UH and wraps the CSDC 
packet with the cote header CH to create a 60-byte tensor UCT in an ingress queue. 
When the controller SM grants the required coimection the tensor passes through 
the switching matrix SM in one switch cycle to the egress router which writes the 
unicast tensor UT into the egress queue indicated in the core header. When the 
tensor reaches the head of the egress queue, the core header is stripped off and the 
remaining CSDC packet is sent to the egress traffic manager. 
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If the CSEX frame type indicates a multicast packet MT as shown in Figure 
5b, the ingress router strips out the multicast mask MM and replicates the packet 
into the indicated ingress queues, modifying the target field for each copy as 
appropriate. The flow then proceeds as for unicast, except that the tensor is written 
simultaneously into multiple egress buffers after passing throu^ the switching 
matrix. 

In the case of Ethernet or variable length packets as shown in Figure 6, an 
ingress traffic manager ITM using segmentation and reassembly functionality 
(SAR) converts the variable length packets VLP into CSIX packets at ingress, 
embedding the SAR header in the payload. CSDC packets are then transported 
though the system m the same way as for the ATM example of Figure 5, except 
that the tensor size is set to 80 bytes (8 vectors of 10 bytes) allowing up to 70 bytes 
of Ethernet fi-ame to be carried in a single segmented packet. Note that the 
segmentation header is considered as private to the traffic managers and is shown 
for illustrative purposes only. The system treats it transparently as part of the 
payload. The CSIX interface description allows for truncated packets, that is, if a 
traffic manager sends a payload that would not fill a tensor it can send a shortened 
CSIX packet The ingress router stores the short packet in the ingress queues (on 
fixed tensor boundaries). Any part of the tensor queue that is not used is filled with 
INVALID bytes. The fixed size tensors will then have the INVALID bytes 
discarded at the egress router. 

In the system architecture, the scheduling and arbitration arrangement is 
distributed and occurs at two points; m the controller SM (between switch ports 
and between priorities) and m the router (between traffic managers) SRS/A. Figure 
7 is a conceptual diagram showing only the scheduling/arbitration functions across 
the data switch from the traffic managers TM tiirou^ the common interfece CSIX 
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to the routers SR and the controller SCM into the switching matrix SC port. The 
diagram also shows how information on channel, link bandwidth allocation and 
switch efficiency, qxieue status, backpressure and traffic congestion management 
is handled by the referenced arrows. 

The controUer SM provides the overall control function of the system. 
When the routers request connections from the controller, they identify their 
requested switching matrix connection by switch port and priority. The controller 
then selects combinations of connections in tiie switching matrix to make best use 
of the matrix connectivity and to provide fair service to the routers. This is 
accomplished by using an arbitration mechanism. The controller SM can also 
enforce pseudo-static bandwidth allocation across the priorities and ingress/egress 
switch port combinations. For example, an external system controUer can 
guarantee a proportion of the available bandwidth to each of the priorities and to 
specific connections. Unused allocations will be fairly shared between other 
priorities and connections. 

The controller SM also has a 'best effort' mechanism to dynamically bias 
the arbitration in favour of long queues for 25)pIications that do not require strict 
bandwidth enforcement 

The routers provide an aggregation function for multiple traffic managers 
into a single switch port. When the controller SM grants a connection to a 
particular egress switch port throu^ a particular priority, the appropriate router 
must choose one of up to eight unicast and one multicast traffic manager queues 
to service. This is accomplished throu^ a wei^ted round-robin mechanism, 
which can select a queue based on a combination of ingress queue length. These 
may allow for fevouring of long queues over shorter ones, or allows traffic 
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manager to temporarily increase the weighting of a queue via the urgency field in 
the CSDC header. Queue bandwidth aUocation is also a factor, determined by the 
external system controUer or by dynamic intervention. Finally, target congestion 
management and traffic shaping are features taken into account The sensitivity of 
the weighting function to these parameters are determined for each priority and, 
together with the bandwidth allocations, may be altered dynamically. 

The system implements three levels of backpressure, described in more 
detail below. These are flow, traffic management and core backpressure. Flow level 
backpressure occurs between ingress and egress traffic managers for the purposes 
of congestion management and traffic shapmg. Backpressure signalling at this 
level is accompUshed by the traffic managers sending packets of data throu^ the 
system and hence is beyond the scope of this document Flow level baclgjressure 
packets will appear to the system to be no different than data packets and as such, 
are transparent 



So far as traffic level backpressure is concerned, the system is organised 
to manage its data-flow at the traffic manager level of granularity (with four 
priorities). Further granularity is achieved at the traffic manager itself An egress 
traffic manager can send backpressure information to the egress side of the router 
over the CSDC interfece, multiplexed with the datastream. Since the egress side of 
the router has just a single queue per traffic manager, this is just a one-bit signal. 
Backpressure between routers is signalled via a dedicated broadcast mechanism in 
the switching controller and switching matrix. There are a number of thresholds 
in the egress buffer queues. When a threshold is crossed, the egress router signals 
the controUer with a baclqjressure broadcast request In the controUer, such a 
request staUs the arbiter at the end of the current cycle and the controUer issues a 
one vector broadcast connection to the switching matrix planes and informs the 
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requesting egress router .The egress router tiien sends one vector's worth (10 bytes) 
of egress buffer status tfarou^ the matrix to the ingress routers. The controDer then 
continues the interrupted cycles. In the event of several egress routers 
simultaneously requesting a backpressure broadcast, the controller will satisfy all 
the requests in a simple round-robin manner before resuming normal service. The 
latency introduced in the backpressure mechanism due to this contention does not 
affect the egress buffering since during this period a router will only be receiving 
backpressure data from other routers, which does not need to be queued. 

An egress router will aggregate the threshold transitions from all its egress 
queues, which have occurred during a switch cycle into one backpressure broadcast 
so that the maximum number of backpressure broadcasts between two tensors, is 
limited to the number of routers. When an ingress router ingress receives a 
backpressure broadcast vector of the form shown in Figure 8, it uses it to update 
the ingress queue weightings as appropriate. 

Two modes of backpressure signalling between egress and ingress routers 
are supported, namely start/stop and multi-state signalling. Multi-state signalling 
allows the egress router to signal the multi-bit state of all its queues (1 byte per 
queue). This multi-state backpressinre signalling coupled with weighted-round- 
robin scheduling in the ingress routers minimises the probability of egress queues 
being full, wiach is significant when attempting to forward multicast or broadcast 
traffic in a heavily utilised switch. 

The ingress router signals stop/start backpressure to the ingress traffic 
managers via the CSIX interface. This provides a 16-bit backpressure signal to 
allow the ingress router to identify the ingress queue to which the signal relates. 
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Egress queue tesholds are set globaUy., whilst ilngress queue thresholds are set 
per queue. 

The controller does not keep track of the state of the egress router buffers. 
However, core level backpressure, is in place across the router/controller interface 
to prevent egress buffer overflows, by preventing the controUer from scheduling 
any traffic to a particular egress router in the event that all of the respective 
routers' buffers are full. 

Multicast in this system chipset is implemented through the optimal 
replication of tensors at ingress and egress. An ingress router has one multicast 
queue per egress router per priority. An ingress multicast tensor (see Figure 8) is 
created m each of the appropriate queues with the egress multicast masks in the 
target fields TM of the core headers. Each ensor, of which three are shown, has a 
length equal to 6 or 8 vectors and a width of 10 bytes. A backpressure vector BPV 
may be inserted between adjacent tensors as shown. The multicast tensors are then 
forwarded through the core in the same way as for unicast and the egress routers' 
then repUcate the tensors into the required egress buffers in parallel. This multicast 
mechanism is intended to provide optimum switch performance with a mix of 
unicast and multicast traffic. In particular, it maintains the efficiency and feimess 
of the scheduling and arbitration aUowing the switch to provide consistent quality 
of service. 



The sytem provides a loss-less febric, therefore multicast tensors cannot be 
forwarded through the switching matrix unless all the destination queues are not 
full. In a heavily utilised switch if only stop/start baclqjressure from the egress 
queues was implemented, this could severely restrict the useable bandwidth for 
multicast traffic. Two mechanisms are included in the system to improve its 
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multicast perfonnance. These are: 1. multi-state backpressure from the egress 
router, which reduces the probability of egress queues being full, and 2. increasing 
the weighting of the multicast ingress queues in the weighted-round-robin 
scheduler when they have been blocked to increase their chances of bemg 
scheduled when the block clears. To avoid muWcast (and broadcast) being blocked 
by off-line egress ports, the backpressure signals can be individuaUy masked out 
by an external system controUer via the Fabric Management Interface (FMI). 

The requirement for wire-speed broadcast (benchmarking) is met by having 
a single on-chip broadcast queue in each egress router. When the controller 
schedules a broadcast connection, the tensor will be routed in the switchmg matrix 
to aU routers in parallel, thus avoiding any mgress congestion (no tensor replication 
at ingress). Broadcast backpressure is provided by having each router inform tiie 
controUer when it transitions in to or out of the state "all-egress-buffers-not-full". 
The controller will only schedule a broadcast when all egress buffers in all routers 
are not full. Broadcast backpressure is a configurable option. If it is not activated, 
the routers do not send status messages and the controUer schedules broadcasts on 
demand. Using this metiiod there is no guarantee that the packet wiU be forwarded 
on aU ports. 

The switching matrix is shown in schematic form in Figure 9. It comprises 
a high-speed, edge-clocked, synchronous, 16 port dual plane serial cross-point 
switch SCN for use in the system. It has been optimised to provide a scaleable, 
high bandwidth, low latency data movement capabUity. It operates under the 
control of the controUer SM, which sends configuration information to the matrix 
over tiie controUer mterface SMI to create connections for tiie transmission of data 
between routers. The buffer and decode logic BDL receives this information and 
uses it to control tiie interconnections within the matrix. Data is appUed in serial 
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form via a serial data iiqjut interface SDI and leaves via a serial data output 
interface SDO. Reset (RS) and clock (CK) signals are applied to the switch as 
necessary, as are signals to and from the fabric management interface FML. The 
form of configuration information, passed in a number of encoded fields, 
determines which input port should be connected to which output port via the 
switching matrix. The central management unit CMU shown in Figure 10 has 
several functions, including synchronisation of the data transmission between the 
switching matrix and all its attached transceivers and to cause attached transceivers 
to phase shift their clocks relative to the external system clock and to maintain this 
shift during normal running, so as to optimise data reception at the switchmg 
matrix of data transmitted from the attached transceivers. A further jfunction is to 
provide a reset interface to the device. 

The NxN matrix shown in Figure 10 contains a concqjtual matrix of 
intemal unidirectional nodes, which allow any input port to be connected to any 
output port, so that data can be transmitted &om any port to any port It is a square 
matrix, such that an n-port matrix has n^ such nodes. At any time, each output port 
is connected to either zero or one input port. When an output port is not connected 
to an input port the data portion of that output port is always logic '0'. Each 
switching matrix SM contains two 16 port matrix planes and a full to half speed 
converter. Each matrix plane can be configured in a number of different formats 
depending on the number of ports to be attached. Possible configurations per plane 
are Ix 16 port, 2x8 port or 4 x 4 port matrices. So, in total, each switching matrix 
SM may be configured as 2 planes x 16 ports, 4 planes x 8 ports or 8 planes x 4 
ports as shown in Fig. 10. The converter allows the switching matrix to support 
systems that contain a mixture of 5Gbps and lOGbps routers. If the matrix is 
configured to operate as a 16 port device, the controller uses the entire control port 
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field to connect ingress and egress ports. For 4 and 8 port configurations, the 
number of bits of the control port field required is 2 and 3 respectively. 

In operation, the switching matrix receives configuration information fiom 
the coutroUer SM via the controller interfece SMI. This information is loaded into, 
and stored in, configuration registers. Routing information is passed in the form of 
a number of encoded fields determining which mpvt port is to be connected to each 
ou^ut port via the switching matrix. In a 16 x 16 matrix, there are 16 oulput ports. 
For each output port there is a four bit source address which is encoded to define 
which input port is to be connected to an output port There is also an enable signal 
for each field to signal that the field is valid and a configure signal that indicates 
that the whole interfece is vaUd. If a field is signalled as not valid, the output port 
for that field is not connected. If the configure signal is not asserted, the matrix 
does not change its current configuration. The configuration information on the 
controller/matrix interface is loaded into the device when the configure signal is 
asserted. A 16-stage programmable pipeline is used to delay the configuration 
infonnation until it is required for switching the matrix. If there is a parity error on 
a port then that ports enable signal will be set to zero and a null tensor will be 
transmitted to the output of that port The register that holds the parity error may 
only be loaded when the configure signal is high and is cleared when read by the 
diagnostic unit. A parity check is also carried out on the configure signal. If a 
parity error occurs here then a parity fail condition is asserted, aU port enable 
signals are set to zero and all the output ports on the device will transmit nuU 
tensors. The connection between the routers and the matrix is via a set of serial data 
streams, each running at one Gbaud. Once a connection across the matrix has been 
set up, tensors are transmitted between ingress and egress routers. The whole 
process exhibits low latency due to a very small insertion delay. Multiple switching 
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matrices can be configured in paraUel to provide a highly scalable interconnect 
function. 



Figure 11 shows the arrangement of the controUer SM. The primary 
function of this is to establish and manage connections through the switching 
matrix to satisfy data movement requirements between user applications. Its 
bandwidth allocation algorithms have been designed such that bandwidth is 
allocated efficiently and feirly. The controller maintains a high throughput and 
guarantees that no queue starvation is e^erienced. A priority selector PSU is 
responsible for selecting which priority level of vectors is to be scheduled at any 
given time. It receives input from the router interface units SRI about the states of 
the queues at each priority level (a function of the length of each queue). Then, 
based on a bandwidth-priority aUocation function built into the unit, it determines 
the priority level that should be serviced next The bandwidth-priority function can 
be loaded during runtime usmg the fabric management interfece FMI referred to 
above, thus aUowing the controller to adjust its priority schedulmg characteristics 
accordmg to the expected load, whenever necessary. 

A scheduling and arbitration unit SAU is responsible for determining which 
set of requests, presented to it are to be granted in the current routing cycle. It 
attempts to deliver a tensor to each output switch port in every arbitration cycle. 
When the logic has determined how to route the vectors across the switch fabric, 
the configuration mformation is passed on to the router interface SRI and the 
switching matrix interfece SCI logic so that the vectors can be transferred. This unit 
can set up new configurations of unicast and broadcast connections withm the 
switching matrix every 30ns, if required. Bandwidth within the switching matrix 
can be allocated on a per connection basis for ^plications such as ATM. 
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Alternatively, the matrix is coniSgured according to a probabilistic, work- 
conserving algorithm located in tiie priority selector unit PSU. 

The router interface unit SIU is provided for every router in the system. 
Each instance provides the functionality described below. The conf oiler SM 
monitors the number of tensors in each of the ingress router queues (each router 
has separate queues for each system destination port, together with a multicast 
queue, at each of four priority levels). The monitoring is done using a pair of 
ti^tiy coiipled state machines, one in the router and the other in the controller. For 
small numbers of vectors in a queue, the controller keeps an exact count of the 
number of vectors. The router notifies the controller when new vectors are added 
to a queue and tiie controller decrements the queue size when it schedules one of 
the vectors in the queue. When there are a larger number of vectors in a queue, the 
controller keeps only an approximate (fuzzy) count of the queue size and is 
informed by the router when the queue size crosses predefined boundaries. This 
minimises the amount of state information that needs to be stored and processed in 
the controller. 

The central management unit CMU is common to all devices. Its functions 
are to provide the Fabric Management Interface FMI between each device and an 
external controller, control error management within the device and provide a reset 
interface RS and reference clocking CK in to each device. 

In operation, the controller SM receives requests for connections fi-om the 
routers over the controller/router interfece SRI. As the connection requests arrive, 
they are queued at tiie router interfece SIU. Since several routers can be requesting 
connections simultaneously, the controller provides scheduling and arbitration 
logic to maximise connection efficiency and to ensure that all ports receive a fiiir 
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level of service, depending on their level of priority. The router interface unit SIU 
presents requests for each non empty queue to the scheduling and arbitration unit 
SAU, which determines which tensors can be routed in any given switcli cycle. 
The scheduling and arbitration unit SAU attempts to deUver a tensor to each router 
in every switch cycle. The arbiter also uses a work-conserving algorithm, located 
in the priority selection unit PSU, to allocate bandwidth in the switching matrix to 
each priority according to information defined by the external system controller. 
Bandwidth can also be allocated on a per connection basis. A typical use for this 
mechanism would be an ATM 'Connection Admission Control' function 
dynamically changing the bandwidth allocation. 

When the scheduling and arbitration unit SAU creates the requested 
connection, the associated pair of routers are notified that the connection is to be 
made and the controller sends the relevant connection control information to the 
switching matrix to establish the reqmred connection. This is done continuously in 
a series of switching cycles, where each cycle involves three key steps: determining 
which connections to set up, setting up the connections and then transferring the 
vectors. These steps are mterleaved so as to keep the switching cycle time as small 
as possible and the throughput of the resulting fabric as high as possible. The 
switching cycle time is a multiple of the system clock. The number of system 
clock cycles per switching cycle affects the operation of the router interface and 
switehing matrix interfece. Egress routers can send baclqaressure to ingress routers 
via a dedicated broadcast mechanism. Backpressure requests, received across the 
controller/router interface fi-om egress routers, are serviced before normal 
connection selections in the scheduling and arbitration unit SAU. The 
backpressure broadcasts are then serviced m a round robin feshion before allowing 
normal operation to continue. 
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There is also another mechanism in the controUer/router interfece referred 
to as 'core level backpressure', which prevents the controller &om scheduling any 
tra£5c to a particular egress router. A router uses core level baclqiressure when all 
its egress buffers are full. 



The controUerer is capable of establishing both unicast and broadcast 
connections in the switching matrix. It is also capable of dealing with system 
configurations that contain a mixture of 'fiill' and 'half speed' ports, for example 
a mixture of lOGbit/sec and 5Gbit/sec routers. 

Figure 12 shows a router device. This is a system port interface control 
device. Its main function is to support user appUcations' data movement 
requirements by providmg access into and out of the system. There are two 
instances of the ingress interface unit IIU, one for each of the traffic managers that 
can be connected to a system port. The EU is responsible for transferring data from 
a trafBc manager into an internal FIFO queue on the router and informing the ICU 
that it has tensors ready to transmit into the system. The external interface to the 
traffic manager utilises common system interface CSDC This defines an n x 8-bit 
data bus; the ingress interface units nU operate m a 32-bit mode. The FIFO is four 
tensors deep to allow one tensor to be transfen^d to the ICU while subsequent ones 
are being received. 



To generate the tensors, the ingress interfece unit appends a three byte 
system core header to the CSIX frame prior to passing it, indicating the tensors 
avaUability to the ICU. The nU examines the CSIX header to determine whether 
the frame is of type unicast, multicast or broadcast and indicates the type to the 
ICU. If the frame is unicast, the EU sets a single bit in byte 1 indicating the 
destination TM, this is derived from the destination address in the CSIX header. If 
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the frame is multicast, a tensor is constructed and sent for each of the 16 system 
ports that have a non-zero CSDC mask. In the case of a broadcast CSTX fiame, byte 
I is set to aU Is by the nu. The nu is also responsible for calculating the two byte 
Tensor Error Check which utilises a cyclic redimdancy check. 

TrafBc manager flow control is provided by making each ingress interface 
unit nu responsible for signalling tra£5c manager start/stop backpressure 
information to its associated egress interface unit EIU. The HU obtains this 
backpressure information by decoding the CSDC control bus. If parity error 
checking has been enabled (the appropriate bit in a status register is set) and the HU 
detects a parity error on CSDC, then an error is logged and the corresponding tensor 
discarded. This log can be retrieved via the FMI. 

The ingress control unit ICU is responsible for accepting tensors from the 
ingress interface units nUs, making connection requests to the controller interfece 
unit SMIU, storing tensors until the controller grants a connection and then sending 
tensors to transceivers TXR. There are two types of connection requests (and 
subsequent grants). One is used for all unicast/multicast tensors and the second is 
used for broadcast tra£5c. For unicast/multicast tensors the ingress control 
unit/controller interfece unit sig n a llin g incorporates the system destination port and 
priority. Clearly for broadcast tensors there is no requirement for a system 
destination address and since there is only one level of broadcast tensors, a priority 
identifier is also not required. 

Ingress buffering is illustrated in Figure 13. This buffering for unicast 
queuing UQ is implemented such that there is one for each possible destination 
trafGc manager and priority. In addition to unicast queues, there is a multicast 
queue MQ per port per priority and a single broadcast queue BQ. The queues are 
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staticaUy aUocated. There are 512 unicast, 64 multicast and one broadcast queue. 
The unicast and multicast queues are located in external SRAM. The queue 
organisation aUows flow control down to 00-12 granularity. Within the unicast 
address field of the CSDC header, 3 bits are aUocated for the number of traffic 
managers a router can support. Since the router supports two traffic managers, the 
spare bit field is used for a function known as Service Channel. Service Channels 
provide the means of fully exploiting the routers implicit OC-12 granularity 
features. 



When the ingress control unit ICU receives a connection grant signal from 
the ControUer interface unit SMIU (which specifies egress port and priority), the 
ICU must choose one of up to 8 qualifying unicast queues or the multicast queue 
from which to forward a tensor. This is achieved using a weighted round-robin 
mechanism, that takes into account several parameters. One is the ingress queue 
length, which allows for the favouring of longer queues over shorter ones and 
another is aggregate queue tensor urgency, which allows a traffic manager to 
temporarily increase the weightmg of a queue via the urgency field m the CSDC 
header. One further parameter taken into account is queue bandwidth allocation, 
whereby an external system controller or system operator can configure the system 
to provide bandwidth aUocation to individual flows via the FMI. The final 
parameter considered is that of target egress queue backpressure. This requires that 
the effective perfonnance of the multicast scheme requires that the probabUity of 
egress queues being fuU be minimised. The sensitivity of the wei^ting fimction 
to the input variables is controUed by four sets of global sensitivity variables (one 
per priority). These settings are configured at system initialisation. 



To provide an ingress flow control mechanism, the ingress control unit ICU 
implements three watermark levels to indicate the state of the queues (feirly empty. 
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filling up, fairly fuU or very full). The watennarics have associated hystereis and 
both values are configurable via the FMI. When a queue moves fi-om one state to 
another, the ICU signals the change to each of the egress interface units EIUs. In 
addition to this 'multistate' backpressure mechanism it is also possible to invoke 
a second mode of baclqjressure signalling that involves only start/stop signalling. 
The backpressure mechanism mode is selected via the FMI by setting the 
watermark levels appropriately. 

The egress control unit ECU signals egress backpressure infonnation to the 
ICU. This infonnation relates either to the signaUing egress router buffers or to 
information the ECU has received about the state of another egress routers buffers. 
If the information relates to the signalling egress routers buffers the ICU updates 
the backpressure status used by the ingress scheduling algorithm and makes a 
request to send backpressure information to the controller interface unit SMIU. If 
it relates to another egress routers buffers then the ICU simply updates its own 
backpressure status. 

There are two instances of the egress interface unit EIU, one for each of the 
traffic managers that are connected to a system port The egress interface unit is 
responsible for accepting tensors fi-om the ECU and transmitting them as frames 
over CSDC to the associated trafBc manager. 

To provide trafBc manager flow control, the egress interface unit EIU 
accepts traffic manager start/stop backpressure infonnation from it's associated 
ingress interface unit nU (that is, the one connected to the same traffic manager). 
If the EIU is currently sending a frame to the traffic manager, then it continues the 
transfer of the cunent frame and then waits until a start indication is received 
before transferring any subsequent frames. 
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To provide ingress flow control the EIU accepts ingress buffer multistate 
backpressxire infonnation from the ICU and sends it immediately to the traffic 
mamanger. 

The egress control unit ECU is responsible for accepting tensors from the 
serial transceivers, when informed by the controller interface untit SMIU of their 
imminent arrival, and forwarding them to the relevant EIU. The ECU examines the 
traffic manager mask byte of the system core header to determine the correct 
destination EIU. In the case of multicast (or broadcast) tensors, multiple bits are 
set in the mask and the tensors are simultaneously transferred to all the EIUs for 
which a corresponding bit is set This feature provides wire speed multicasting at 
the egress router. The ECU is responsible for checking the tensor error check bytes 
of the system core header. If the system core error checking has been enabled (i.e. 
the appropriate bit in a status register is set) and the ECU detects an error, then it 
is logged and the corresponding tensor discarded. To provide an egress flow 
control mechanism the ECU implements three watermark levels to indicate the 
state of the egress buffers (fahrly empty, filling up, fairly full or very fiill). When 
an egress buffer moves from one state to another the ECU signals the change to the 
ICU. The level of the watermarks is configurable via the FMI. In addition to this 
multistate backpressure mechanism it is also possible to invoke a second mode of 
backpressure signalling that involves only start/stop signalling. The type of 
baclqjressure mechanism is selected via the FMI by setting the watermark levels 
appropriately. 

The controUer interface imit SMIU is responsible for controlling the 
interface to the controUer. Since the controller operates at the system port rather 
than the traffic manager port level of granularity, the SMIU also operates at this 
level. The SMIU maintains a comt of the number of tensors in the ingress queues 
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associated with each destination system port. ITie count is incremented each time 
the SMIU is informed of a tensor arrival by the ICU and decremented each time the 
SMIU receives a grant from the controller. 

The controller interface unit SMIU contains a state machine that is tightly 
coupled to a corresponding one in the controDer. For smaU numbers of tensors 
(less than about six or seven), the SMIU notifies the controller of each new tensor 
arrival. For larger numbers of tensors, the SMIU only informs the controller when 
the count value crosses predefined boundaries. 

The central management unit is common to all devices. Its functions are 
to provide a FMI between each device and an external controller, control error 
management within the device and provide a reset interface and reference clocking 
in to each device. 

Referring back to Figure 12, the routers provide access into the system via 
CSIX ingress and egress interfaces. On receiving a CSIX packet from the ingress 
trafBc manager, over the CSIX interface ICSIX, the ingress interface unit nu 
checks the type and vaUdity of the packet The packet is then wrapped with a core 
header, the contents of which vary with the packet type. When the core header has 
been appended, the packet becomes known as a tensor. The ingress control unit 
ICU makes a request to the controUer through the controUerer interfece SMI for a 
connection across the switching matrix and stores the tensors until the connection 
is created. In order to eliminate head of line blocking for unicast traffic, ingress 
buffering is organised into separate queues, one for each possible destination traffic 
manager TMQl to TMQN and priority PI to P4 as shown m Fig. 13. Individual 
queues per priority are not required to avoid head of line blocking but are 
advantageous as they allow the controUer to enforce bandwidth allocation to each 
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priority in the switch. In addition to the imicast queue there is a multicast queue 
per port per priority and a single broadcast queue. The unicast and multicast 
queues are statically aUocated in external SRAM. The purpose of this level of 
buflFering is to aUow the controller to allocate connections efficiently by giving it 
a view of the ingress datastreams and to provide rate matching between the router 
external interfaces and the router/matrix interface. 

When connections are granted, the controller creates a connection across 
the switching matrix to the requested egress router at a given priority. The ingress 
control unit ICU must now choose one of the qualifying unicast or multicast queues 
from which to forward a tensor to the transceiver for serialisation. This level of 
router scheduling is done on a weighted-round-robin-basis. Each unicast and 
multicast queue has weighting associated with it, ..which is detennined by the 
backpressure from the egress buffers, the queue length, the queue urgency and the 
static bandwidth allocation. On the egress side the controllerer infonns the router 
of a tensor's imminent amval. The egress control unit ECU receives this tensor and 
examines the core header to see which traffic manager to send the tensor to. 
Tensors are then assembled back into datastreams and forwarded via CSEX to the 
appropriate traffic manager. 

Multicasting in the system is achieved by the optimal replication of tensors 
at the ingress and egress. On die ingress side a router has one multicast queue per 
egress router at each priority. Multicast routing information is appended on the 
ingress side and on arrival at the egress side tiiese masks determine the repUcation 
of tensors into tiie required egress buffers. Broadcast in the system is achieved by 
having a single on chip broadcast queue at tiie ingress of each router. When the 
controller schedules a broadcast connection, the tensor wiU be routed by the matrix 
to all egress routers in parallel, thus avoiding any ingress congestiorL 
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There are four interfeces to the router, which are used in nonnal operation. 
These are the controUer interface, the switching matrix interface, multiple CSIX 
interfaces and the febric management interface FMI. 

The open standard common switch interface (CSIX) provides for the 
transfer of data and control information between a traffic manager and a switch 
fabric and provides the system with a level of protocol independence. Hie actual 
operation of CSIX is fondamentaUy quite simple. A traffic manager is required to 
compile a 4-byte CSDC header, which includes information such as frame type, 
destination port, priority and urgency. Urgency is a concept that allows a 
particular queue in the router to have one of its priorities temporarily favoured in 
order for it to have a greater chance of being scheduled next This is one of the 
features that assists traffic shaping and buffer optimisation and generaUy helps to 
maintain a high quality of service. Each CSIX interface is a point to point, bi- 
directional link between the router and a traffic manager. A single CSIX interface 
supports one traffic manager up to OC-12. A number of CSIX data paths can be 
grouped together to support higher bandwidth traffic managers whilst using a 
single control path. Each CSIX data path is a multiple of 8 bits in each direction 
(TxandRx). 

The fabric management interfece FMI is implemented as a bit serial 
interface to/from each system device. It operates at 25Mhz and uses a proprietary 
protocol. The FMI performs a number of functions. It is the primary interface for 
system control of the chip-set. It allows a switch manager to read run time status 
information. It also aUows aU the devices to be updated dynamicaUy with 
information required for bandwidth aUocation. The FMI also provides access for 
system establishment and initialisation. 
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Each system device contains a logic block known as the febric management 
interface unit (FMIU). The IMU interfeces to the functional logic, also known as 
the core, within the device in order to provide run-time (read/write) access to a 
chosen subset of the registers and RAM locations, a mechanism to report run-time 
fail conditions detected by the device, and scan access (read/write) to the total set 
of registers in the functional logic while the functional logic is not operational. 

The external interface to the fabric management interface unit FMIU 
requires a number of inputs, including a Hard Reset input which sets the system 
device into a known state. In particular, it sets the device into a state where the 
FMIU is fully functional and the serial interface can be used. Hard Reset is 
expected to be appUed when power is first ^Ued to the device, and may also be 
applied at other times. The external interface also has a serial inpt and serial output 
lines and a device locator address field used to identify a particular instance of a 
device. The device locator field is generated by tie-offe that are determined by the 
devices physical position in the system. 

The main functions of the central management unit (CMU) shown in 
Figure 12 include error detection and logging logic. This is responsible for 
detecting error conditions and states within the chip or on its interfaces. As such, 
its functionality is spread throughout the design and is not concentrated withm a 
specific block. Errors are reported and stored in the Error and Status registers and 
logs, which are accessible across the FMI. The CMU also has reset and clock 
generation logic responsible for the generation and distribution of clocks and reset 
signals within the device. In addition, the CMU contains test control logic which 
controls the mechanisms built in for chip test The target feult coverage is 99.9%. 
This logic is not used under normal operating conditions. The final function of the 
CMU is to provide febric management logic common to all of the system devices 
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which provides access to error logs and configuration data from an external 
controller. It also provides the functionality for device scan test and PCB testing. 



In summary, the central management unit provides access to device testing, 
system establishment and error and status reporting over the system. 
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CLAIMS 

1. A method of handling packets of information through a data switch 
comprising input trafBc controUers, ingress routers, a memoiyless cycUc switch 
fabric, egress routers and output traffic controllers aU under the control of a switch 
controller and interconnected such that each input line connected to the data switch 
is terminated on a traffic controller arranged to convert the input line protocol 
information packets into fixed length cells having a header defining the data switch 
destination router and output traffic controller together with message priority 
information arranged such that each ingress router serves a group of traffic 
controllers chairactgri.sed fn thflt the ingress router includes a set of input buffers 
one for each input line and a set of virtual output queue buffers, one for each - 
outpMt traffic controHer firom the data switch, and in which the method comprises 
on the arrival of a cell from a traffic controUer the ingress router examines the ceU 
header and places it in the ^propriate virtual output queue and generates a request 
for transfer message consisting of the destination traffic controller address and a 
message priority code which is passed to the data switch controller, the switch 
conu-oller schedules the passage of the ceUs across the. switch fabric by 
interconnecting a specific ingress router to a specific egress router for each switch 
fabric cycle in accordance with a first arbitration process the ingress router 
selecting firom the appropriate virtual output queue the cell at the head of the queue 
for passage across the data switch to the appropriate ouQjut traffic controUer in 
accordance with a second arbitration process. 

2. A method of handling packets of information through a data switch as 
claimed m claim 1 ghwapterised m thgt the ingress buffering is organised mto 
separate queues, one for each destination traffic controller and each priority level. 
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3. A method of handling packets of information through a data switch as 
claimed in claim 1 or 2 characteriised m that the ingress router uses a weighted 
round-robin arbitration process to select the next queue buffer based upon ingress 
queue length, aggregate queue packet urgency and target trafSc controller egress 
queue backpressiire. 

4. A method of handling packets of infonnation through a data switch as 
claimed in claim 1, 2 or 3 chayacteri.sed m that the first arbitration process involves 
detemiining the set of requests to be accepted for each switch fabric cycle attempts 
to deliver a packet of information to each output switch fabric port in every 
arbitration cycle. 

5. A data switch for handling packets of infonqation comprising input traffic 
controllers, ingress routers, a memoryless cychc switch febric, egress routers and 
output traffic controllers all under the control of a switch controUer and 
interconnected such that each input line connected to the data switch is terminated 
on a traffic controller arranged to convert the input line protocol information 
packets into fixed length cells having a header defining the data switch destination 
router and ou^ut traffic controUer together with message priority information 
arranged such that each ingress router serves a group of traffic controUers 
gh3Tacterisgd fn that the mgress router includes a set of input buffers one for each 
input line and a set of virtual output queue buffers, one for each output traffic 
controller connected to the data switch, and in which on the arrival of a cell from 
a traffic controUer the ingress router examines the ceU header and places it in the 
appropriate virtual output queue and generates a request for transfer message 
consisting of the destination traffic controller address and a message priority code 
which is passed to the data switch controUer, the switch controUer schedules the 
passage of the ceUs across the switch fabric by interconnecting a specific ingress 
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router to a specific egress router for each switch febric cycle in accordance with a 
first arbitration process and the ingress router selects firom the appropriate virtual 
ouQjut queue the cell at the head of the queue for passage across the data switch to 
the appropriate output trafBc controller in accordance with a second aibitration 
process. 

6. A data switch for handling packets of information as claimed in claim 5 
char^pterised ijr that the virtual output queues are arranged as separate queues one 
for each destination trafBc controller and each priority level. 

7. A data switch for handling packets of information as claimed in claim 5 or 
6 characterisgd m th^t the ingress router uses a weighted round-robin mechanism 
to select the next queue buffer based on ingress queue length, aggregate queue 
packet urgency and target trafBc controller egress queue backpressure. 

8. A data svdtch for handling packets of information as claimed in claim 5, 
6 or 7 chgracterisgd m thnt the switch controller performs a first arbitration process 
which mvolves determining the set of requests to be accepted for each switch febric 
cycle by attempting to deliver a packet of information to each output switch fabric 
port in every arbitration cycle. 

9. A method of handling packets of information throu^ a switch as described 
and shown in ttie accompanying drawings. 



1 0. A data switch for handling packets of information as described and shown 
in the accompanying drawings. 
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